aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/COLO-FT.txt107
-rw-r--r--docs/about/build-platforms.rst86
-rw-r--r--docs/about/deprecated.rst477
-rw-r--r--docs/about/emulation.rst184
-rw-r--r--docs/about/index.rst17
-rw-r--r--docs/about/license.rst2
-rw-r--r--docs/about/removed-features.rst344
-rw-r--r--docs/amd-memory-encryption.txt148
-rw-r--r--docs/block-replication.txt58
-rw-r--r--docs/ccid.txt182
-rw-r--r--docs/colo-proxy.txt6
-rw-r--r--docs/conf.py58
-rw-r--r--docs/config/mach-virt-graphical.cfg4
-rw-r--r--docs/config/mach-virt-serial.cfg4
-rw-r--r--docs/config/q35-emulated.cfg6
-rw-r--r--docs/config/q35-virtio-graphical.cfg6
-rw-r--r--docs/config/q35-virtio-serial.cfg2
-rw-r--r--docs/devel/acpi-bits.rst167
-rw-r--r--docs/devel/atomics.rst84
-rw-r--r--docs/devel/block-coroutine-wrapper.rst6
-rw-r--r--docs/devel/build-system.rst500
-rw-r--r--docs/devel/ci-definitions.rst.inc (renamed from docs/devel/ci-definitions.rst)2
-rw-r--r--docs/devel/ci-jobs.rst51
-rw-r--r--docs/devel/ci-jobs.rst.inc197
-rw-r--r--docs/devel/ci-runners.rst.inc (renamed from docs/devel/ci-runners.rst)0
-rw-r--r--docs/devel/ci.rst17
-rw-r--r--docs/devel/clocks.rst6
-rw-r--r--docs/devel/code-of-conduct.rst2
-rw-r--r--docs/devel/decodetree.rst33
-rw-r--r--docs/devel/docs.rst68
-rw-r--r--docs/devel/fuzzing.rst35
-rw-r--r--docs/devel/index-api.rst18
-rw-r--r--docs/devel/index-build.rst20
-rw-r--r--docs/devel/index-internals.rst22
-rw-r--r--docs/devel/index-process.rst19
-rw-r--r--docs/devel/index-tcg.rst19
-rw-r--r--docs/devel/index.rst67
-rw-r--r--docs/devel/kconfig.rst30
-rw-r--r--docs/devel/loads-stores.rst128
-rw-r--r--docs/devel/maintainers.rst107
-rw-r--r--docs/devel/memory.rst14
-rw-r--r--docs/devel/migration/CPR.rst147
-rw-r--r--docs/devel/migration/best-practices.rst48
-rw-r--r--docs/devel/migration/compatibility.rst517
-rw-r--r--docs/devel/migration/dirty-limit.rst71
-rw-r--r--docs/devel/migration/features.rst14
-rw-r--r--docs/devel/migration/index.rst13
-rw-r--r--docs/devel/migration/main.rst (renamed from docs/devel/migration.rst)377
-rw-r--r--docs/devel/migration/mapped-ram.rst138
-rw-r--r--docs/devel/migration/postcopy.rst313
-rw-r--r--docs/devel/migration/vfio.rst208
-rw-r--r--docs/devel/migration/virtio.rst115
-rw-r--r--docs/devel/modules.rst2
-rw-r--r--docs/devel/multi-process.rst33
-rw-r--r--docs/devel/multi-thread-tcg.rst8
-rw-r--r--docs/devel/multiple-iothreads.txt70
-rw-r--r--docs/devel/nested-papr.txt119
-rw-r--r--docs/devel/pci.rst8
-rw-r--r--docs/devel/qapi-code-gen.rst315
-rw-r--r--docs/devel/qdev-api.rst7
-rw-r--r--docs/devel/qgraph.rst134
-rw-r--r--docs/devel/qom-api.rst9
-rw-r--r--docs/devel/qom.rst108
-rw-r--r--docs/devel/qtest.rst5
-rw-r--r--docs/devel/replay.rst306
-rw-r--r--docs/devel/replay.txt46
-rw-r--r--docs/devel/reset.rst96
-rw-r--r--docs/devel/s390-cpu-topology.rst170
-rw-r--r--docs/devel/stable-process.rst2
-rw-r--r--docs/devel/style.rst154
-rw-r--r--docs/devel/submitting-a-patch.rst593
-rw-r--r--docs/devel/submitting-a-pull-request.rst73
-rw-r--r--docs/devel/tcg-icount.rst6
-rw-r--r--docs/devel/tcg-ops.rst979
-rw-r--r--docs/devel/tcg-plugins.rst349
-rw-r--r--docs/devel/tcg.rst27
-rw-r--r--docs/devel/testing.rst521
-rw-r--r--docs/devel/tracing.rst92
-rw-r--r--docs/devel/trivial-patches.rst52
-rw-r--r--docs/devel/ui.rst4
-rw-r--r--docs/devel/vfio-iommufd.rst166
-rw-r--r--docs/devel/vfio-migration.rst150
-rw-r--r--docs/devel/virtio-backends.rst214
-rw-r--r--docs/devel/virtio-migration.txt108
-rw-r--r--docs/devel/writing-monitor-commands.rst648
-rw-r--r--docs/devel/writing-qmp-commands.rst622
-rw-r--r--docs/devel/zoned-storage.rst62
-rw-r--r--docs/hyperv.txt222
-rw-r--r--docs/image-fuzzer.txt6
-rw-r--r--docs/interop/bitmaps.rst291
-rw-r--r--docs/interop/dbus-display.rst31
-rw-r--r--docs/interop/dbus-vmstate.rst52
-rw-r--r--docs/interop/dbus.rst2
-rw-r--r--docs/interop/firmware.json459
-rw-r--r--docs/interop/index.rst4
-rw-r--r--docs/interop/live-block-operations.rst55
-rw-r--r--docs/interop/nbd.txt8
-rw-r--r--docs/interop/prl-xml.txt2
-rw-r--r--docs/interop/qcow2.txt19
-rw-r--r--docs/interop/qemu-ga.rst13
-rw-r--r--docs/interop/qemu-qmp-ref.rst2
-rw-r--r--docs/interop/qmp-intro.txt88
-rw-r--r--docs/interop/qmp-spec.rst (renamed from docs/interop/qmp-spec.txt)333
-rw-r--r--docs/interop/vhost-user-gpu.rst62
-rw-r--r--docs/interop/vhost-user.rst1029
-rw-r--r--docs/interop/virtio-balloon-stats.rst (renamed from docs/virtio-balloon-stats.txt)58
-rw-r--r--docs/interop/vnc-ledstate-pseudo-encoding.rst (renamed from docs/interop/vnc-ledstate-Pseudo-encoding.txt)0
-rw-r--r--docs/meson.build46
-rw-r--r--docs/multi-thread-compression.txt12
-rw-r--r--docs/multiseat.txt2
-rw-r--r--docs/papr-pef.txt30
-rw-r--r--docs/pcie.txt16
-rw-r--r--docs/pcie_sriov.txt112
-rw-r--r--docs/pvrdma.txt345
-rw-r--r--docs/qdev-device-use.txt4
-rw-r--r--docs/qemu_logo.pdfbin9117 -> 0 bytes
-rw-r--r--docs/rdma.txt2
-rw-r--r--docs/replay.txt410
-rw-r--r--docs/requirements.txt5
-rw-r--r--docs/specs/acpi_erst.rst200
-rw-r--r--docs/specs/edu.rst (renamed from docs/specs/edu.txt)86
-rw-r--r--docs/specs/fsi.rst122
-rw-r--r--docs/specs/fw_cfg.rst (renamed from docs/specs/fw_cfg.txt)211
-rw-r--r--docs/specs/index.rst15
-rw-r--r--docs/specs/ivshmem-spec.rst (renamed from docs/specs/ivshmem-spec.txt)63
-rw-r--r--docs/specs/pci-ids.rst100
-rw-r--r--docs/specs/pci-ids.txt71
-rw-r--r--docs/specs/pci-serial.rst37
-rw-r--r--docs/specs/pci-serial.txt34
-rw-r--r--docs/specs/pci-testdev.rst39
-rw-r--r--docs/specs/pci-testdev.txt31
-rw-r--r--docs/specs/ppc-spapr-hcalls.rst99
-rw-r--r--docs/specs/ppc-spapr-hcalls.txt78
-rw-r--r--docs/specs/ppc-spapr-hotplug.rst510
-rw-r--r--docs/specs/ppc-spapr-hotplug.txt409
-rw-r--r--docs/specs/ppc-spapr-uv-hcalls.rst89
-rw-r--r--docs/specs/ppc-spapr-uv-hcalls.txt76
-rw-r--r--docs/specs/pvpanic.rst (renamed from docs/specs/pvpanic.txt)43
-rw-r--r--docs/specs/sev-guest-firmware.rst125
-rw-r--r--docs/specs/standard-vga.rst94
-rw-r--r--docs/specs/standard-vga.txt81
-rw-r--r--docs/specs/tpm.rst71
-rw-r--r--docs/specs/virt-ctlr.rst (renamed from docs/specs/virt-ctlr.txt)12
-rw-r--r--docs/specs/vmcoreinfo.rst54
-rw-r--r--docs/specs/vmcoreinfo.txt53
-rw-r--r--docs/specs/vmgenid.rst246
-rw-r--r--docs/specs/vmgenid.txt245
-rw-r--r--docs/specs/vmw_pvscsi-spec.rst115
-rw-r--r--docs/specs/vmw_pvscsi-spec.txt92
-rw-r--r--docs/sphinx-static/custom.js9
-rw-r--r--docs/sphinx/dbusdoc.py166
-rw-r--r--docs/sphinx/dbusdomain.py410
-rw-r--r--docs/sphinx/dbusparser.py373
-rw-r--r--docs/sphinx/depfile.py19
-rw-r--r--docs/sphinx/fakedbusdoc.py30
-rw-r--r--docs/sphinx/hxtool.py18
-rw-r--r--docs/sphinx/kerneldoc.py5
-rw-r--r--docs/sphinx/qapidoc.py37
-rw-r--r--docs/sphinx/qmp_lexer.py5
-rw-r--r--docs/system/arm/aspeed.rst195
-rw-r--r--docs/system/arm/b-l475e-iot01a.rst45
-rw-r--r--docs/system/arm/bananapi_m2u.rst140
-rw-r--r--docs/system/arm/cpu-features.rst164
-rw-r--r--docs/system/arm/cubieboard.rst2
-rw-r--r--docs/system/arm/emulation.rst53
-rw-r--r--docs/system/arm/mps2.rst37
-rw-r--r--docs/system/arm/nuvoton.rst7
-rw-r--r--docs/system/arm/orangepi.rst12
-rw-r--r--docs/system/arm/palm.rst2
-rw-r--r--docs/system/arm/raspi.rst15
-rw-r--r--docs/system/arm/sbsa.rst86
-rw-r--r--docs/system/arm/stm32.rst7
-rw-r--r--docs/system/arm/vexpress.rst3
-rw-r--r--docs/system/arm/virt.rst67
-rw-r--r--docs/system/arm/xenpvh.rst39
-rw-r--r--docs/system/arm/xlnx-versal-virt.rst80
-rw-r--r--docs/system/arm/xscale.rst2
-rw-r--r--docs/system/authz.rst26
-rw-r--r--docs/system/confidential-guest-support.rst (renamed from docs/confidential-guest-support.txt)15
-rw-r--r--docs/system/cpu-models-x86-abi.csv20
-rw-r--r--docs/system/cpu-models-x86.rst.inc4
-rw-r--r--docs/system/device-emulation.rst11
-rw-r--r--docs/system/device-url-syntax.rst.inc6
-rw-r--r--docs/system/devices/can.rst (renamed from docs/can.txt)103
-rw-r--r--docs/system/devices/canokey.rst158
-rw-r--r--docs/system/devices/ccid.rst171
-rw-r--r--docs/system/devices/cxl.rst414
-rw-r--r--docs/system/devices/igb.rst73
-rw-r--r--docs/system/devices/ivshmem.rst4
-rw-r--r--docs/system/devices/keyboard.rst129
-rw-r--r--docs/system/devices/net.rst2
-rw-r--r--docs/system/devices/nvme.rst166
-rw-r--r--docs/system/devices/usb-u2f.rst93
-rw-r--r--docs/system/devices/usb.rst65
-rw-r--r--docs/system/devices/vhost-user-input.rst45
-rw-r--r--docs/system/devices/vhost-user-rng.rst41
-rw-r--r--docs/system/devices/vhost-user.rst76
-rw-r--r--docs/system/devices/virtio-gpu.rst112
-rw-r--r--docs/system/devices/virtio-snd.rst49
-rw-r--r--docs/system/gdb.rst50
-rw-r--r--docs/system/guest-loader.rst8
-rw-r--r--docs/system/i386/amd-memory-encryption.rst206
-rw-r--r--docs/system/i386/hyperv.rst288
-rw-r--r--docs/system/i386/kvm-pv.rst100
-rw-r--r--docs/system/i386/sgx.rst188
-rw-r--r--docs/system/i386/xen.rst144
-rw-r--r--docs/system/images.rst2
-rw-r--r--docs/system/index.rst9
-rw-r--r--docs/system/introduction.rst219
-rw-r--r--docs/system/invocation.rst5
-rw-r--r--docs/system/keys.rst2
-rw-r--r--docs/system/keys.rst.inc13
-rw-r--r--docs/system/linuxboot.rst2
-rw-r--r--docs/system/loongarch/virt.rst108
-rw-r--r--docs/system/multi-process.rst4
-rw-r--r--docs/system/openrisc/cpu-features.rst15
-rw-r--r--docs/system/openrisc/emulation.rst17
-rw-r--r--docs/system/openrisc/or1k-sim.rst43
-rw-r--r--docs/system/openrisc/virt.rst50
-rw-r--r--docs/system/ppc/amigang.rst161
-rw-r--r--docs/system/ppc/embedded.rst1
-rw-r--r--docs/system/ppc/powernv.rst70
-rw-r--r--docs/system/ppc/ppce500.rst43
-rw-r--r--docs/system/ppc/pseries.rst292
-rw-r--r--docs/system/qemu-block-drivers.rst.inc43
-rw-r--r--docs/system/qemu-manpage.rst5
-rw-r--r--docs/system/quickstart.rst21
-rw-r--r--docs/system/replay.rst237
-rw-r--r--docs/system/riscv/shakti-c.rst2
-rw-r--r--docs/system/riscv/sifive_u.rst33
-rw-r--r--docs/system/riscv/virt.rst86
-rw-r--r--docs/system/s390x/bootdevices.rst28
-rw-r--r--docs/system/s390x/cpu-topology.rst246
-rw-r--r--docs/system/s390x/pcidevices.rst41
-rw-r--r--docs/system/target-arm.rst3
-rw-r--r--docs/system/target-i386-desc.rst.inc10
-rw-r--r--docs/system/target-i386.rst9
-rw-r--r--docs/system/target-mips.rst14
-rw-r--r--docs/system/target-openrisc.rst71
-rw-r--r--docs/system/target-ppc.rst1
-rw-r--r--docs/system/target-riscv.rst24
-rw-r--r--docs/system/target-s390x.rst2
-rw-r--r--docs/system/target-sparc.rst2
-rw-r--r--docs/system/targets.rst1
-rw-r--r--docs/system/tls.rst4
-rw-r--r--docs/system/vm-templating.rst125
-rw-r--r--docs/throttle.txt8
-rw-r--r--docs/tools/index.rst3
-rw-r--r--docs/tools/qemu-img.rst55
-rw-r--r--docs/tools/qemu-nbd.rst42
-rw-r--r--docs/tools/qemu-pr-helper.rst4
-rw-r--r--docs/tools/qemu-storage-daemon.rst47
-rw-r--r--docs/tools/qemu-trace-stap.rst24
-rw-r--r--docs/tools/virtfs-proxy-helper.rst3
-rw-r--r--docs/tools/virtiofsd.rst360
-rw-r--r--docs/u2f.txt110
-rw-r--r--docs/user/index.rst2
-rw-r--r--docs/user/main.rst20
258 files changed, 19060 insertions, 7220 deletions
diff --git a/docs/COLO-FT.txt b/docs/COLO-FT.txt
index 8d6d53a5a2..2e760a4aee 100644
--- a/docs/COLO-FT.txt
+++ b/docs/COLO-FT.txt
@@ -209,9 +209,10 @@ children.0=childs0 \
3. On Secondary VM's QEMU monitor, issue command
-{'execute':'qmp_capabilities'}
-{'execute': 'nbd-server-start', 'arguments': {'addr': {'type': 'inet', 'data': {'host': '0.0.0.0', 'port': '9999'} } } }
-{'execute': 'nbd-server-add', 'arguments': {'device': 'parent0', 'writable': true } }
+{"execute":"qmp_capabilities"}
+{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [ {"capability": "x-colo", "state": true } ] } }
+{"execute": "nbd-server-start", "arguments": {"addr": {"type": "inet", "data": {"host": "0.0.0.0", "port": "9999"} } } }
+{"execute": "nbd-server-add", "arguments": {"device": "parent0", "writable": true } }
Note:
a. The qmp command nbd-server-start and nbd-server-add must be run
@@ -222,11 +223,11 @@ Note:
will be merged into the parent disk on failover.
4. On Primary VM's QEMU monitor, issue command:
-{'execute':'qmp_capabilities'}
-{'execute': 'human-monitor-command', 'arguments': {'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0'}}
-{'execute': 'x-blockdev-change', 'arguments':{'parent': 'colo-disk0', 'node': 'replication0' } }
-{'execute': 'migrate-set-capabilities', 'arguments': {'capabilities': [ {'capability': 'x-colo', 'state': true } ] } }
-{'execute': 'migrate', 'arguments': {'uri': 'tcp:127.0.0.2:9998' } }
+{"execute":"qmp_capabilities"}
+{"execute": "human-monitor-command", "arguments": {"command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0"}}
+{"execute": "x-blockdev-change", "arguments":{"parent": "colo-disk0", "node": "replication0" } }
+{"execute": "migrate-set-capabilities", "arguments": {"capabilities": [ {"capability": "x-colo", "state": true } ] } }
+{"execute": "migrate", "arguments": {"uri": "tcp:127.0.0.2:9998" } }
Note:
a. There should be only one NBD Client for each primary disk.
@@ -249,59 +250,59 @@ if you want to resume the replication, follow "Secondary resume replication"
== Primary Failover ==
The Secondary died, resume on the Primary
-{'execute': 'x-blockdev-change', 'arguments':{ 'parent': 'colo-disk0', 'child': 'children.1'} }
-{'execute': 'human-monitor-command', 'arguments':{ 'command-line': 'drive_del replication0' } }
-{'execute': 'object-del', 'arguments':{ 'id': 'comp0' } }
-{'execute': 'object-del', 'arguments':{ 'id': 'iothread1' } }
-{'execute': 'object-del', 'arguments':{ 'id': 'm0' } }
-{'execute': 'object-del', 'arguments':{ 'id': 'redire0' } }
-{'execute': 'object-del', 'arguments':{ 'id': 'redire1' } }
-{'execute': 'x-colo-lost-heartbeat' }
+{"execute": "x-blockdev-change", "arguments":{ "parent": "colo-disk0", "child": "children.1"} }
+{"execute": "human-monitor-command", "arguments":{ "command-line": "drive_del replication0" } }
+{"execute": "object-del", "arguments":{ "id": "comp0" } }
+{"execute": "object-del", "arguments":{ "id": "iothread1" } }
+{"execute": "object-del", "arguments":{ "id": "m0" } }
+{"execute": "object-del", "arguments":{ "id": "redire0" } }
+{"execute": "object-del", "arguments":{ "id": "redire1" } }
+{"execute": "x-colo-lost-heartbeat" }
== Secondary Failover ==
The Primary died, resume on the Secondary and prepare to become the new Primary
-{'execute': 'nbd-server-stop'}
-{'execute': 'x-colo-lost-heartbeat'}
+{"execute": "nbd-server-stop"}
+{"execute": "x-colo-lost-heartbeat"}
-{'execute': 'object-del', 'arguments':{ 'id': 'f2' } }
-{'execute': 'object-del', 'arguments':{ 'id': 'f1' } }
-{'execute': 'chardev-remove', 'arguments':{ 'id': 'red1' } }
-{'execute': 'chardev-remove', 'arguments':{ 'id': 'red0' } }
+{"execute": "object-del", "arguments":{ "id": "f2" } }
+{"execute": "object-del", "arguments":{ "id": "f1" } }
+{"execute": "chardev-remove", "arguments":{ "id": "red1" } }
+{"execute": "chardev-remove", "arguments":{ "id": "red0" } }
-{'execute': 'chardev-add', 'arguments':{ 'id': 'mirror0', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '0.0.0.0', 'port': '9003' } }, 'server': true } } } }
-{'execute': 'chardev-add', 'arguments':{ 'id': 'compare1', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '0.0.0.0', 'port': '9004' } }, 'server': true } } } }
-{'execute': 'chardev-add', 'arguments':{ 'id': 'compare0', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '127.0.0.1', 'port': '9001' } }, 'server': true } } } }
-{'execute': 'chardev-add', 'arguments':{ 'id': 'compare0-0', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '127.0.0.1', 'port': '9001' } }, 'server': false } } } }
-{'execute': 'chardev-add', 'arguments':{ 'id': 'compare_out', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '127.0.0.1', 'port': '9005' } }, 'server': true } } } }
-{'execute': 'chardev-add', 'arguments':{ 'id': 'compare_out0', 'backend': {'type': 'socket', 'data': {'addr': { 'type': 'inet', 'data': { 'host': '127.0.0.1', 'port': '9005' } }, 'server': false } } } }
+{"execute": "chardev-add", "arguments":{ "id": "mirror0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "0.0.0.0", "port": "9003" } }, "server": true } } } }
+{"execute": "chardev-add", "arguments":{ "id": "compare1", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "0.0.0.0", "port": "9004" } }, "server": true } } } }
+{"execute": "chardev-add", "arguments":{ "id": "compare0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9001" } }, "server": true } } } }
+{"execute": "chardev-add", "arguments":{ "id": "compare0-0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9001" } }, "server": false } } } }
+{"execute": "chardev-add", "arguments":{ "id": "compare_out", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9005" } }, "server": true } } } }
+{"execute": "chardev-add", "arguments":{ "id": "compare_out0", "backend": {"type": "socket", "data": {"addr": { "type": "inet", "data": { "host": "127.0.0.1", "port": "9005" } }, "server": false } } } }
== Primary resume replication ==
Resume replication after new Secondary is up.
Start the new Secondary (Steps 2 and 3 above), then on the Primary:
-{'execute': 'drive-mirror', 'arguments':{ 'device': 'colo-disk0', 'job-id': 'resync', 'target': 'nbd://127.0.0.2:9999/parent0', 'mode': 'existing', 'format': 'raw', 'sync': 'full'} }
+{"execute": "drive-mirror", "arguments":{ "device": "colo-disk0", "job-id": "resync", "target": "nbd://127.0.0.2:9999/parent0", "mode": "existing", "format": "raw", "sync": "full"} }
Wait until disk is synced, then:
-{'execute': 'stop'}
-{'execute': 'block-job-cancel', 'arguments':{ 'device': 'resync'} }
+{"execute": "stop"}
+{"execute": "block-job-cancel", "arguments":{ "device": "resync"} }
-{'execute': 'human-monitor-command', 'arguments':{ 'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0'}}
-{'execute': 'x-blockdev-change', 'arguments':{ 'parent': 'colo-disk0', 'node': 'replication0' } }
+{"execute": "human-monitor-command", "arguments":{ "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.2,file.port=9999,file.export=parent0,node-name=replication0"}}
+{"execute": "x-blockdev-change", "arguments":{ "parent": "colo-disk0", "node": "replication0" } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-mirror', 'id': 'm0', 'props': { 'netdev': 'hn0', 'queue': 'tx', 'outdev': 'mirror0' } } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-redirector', 'id': 'redire0', 'props': { 'netdev': 'hn0', 'queue': 'rx', 'indev': 'compare_out' } } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-redirector', 'id': 'redire1', 'props': { 'netdev': 'hn0', 'queue': 'rx', 'outdev': 'compare0' } } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'iothread', 'id': 'iothread1' } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'colo-compare', 'id': 'comp0', 'props': { 'primary_in': 'compare0-0', 'secondary_in': 'compare1', 'outdev': 'compare_out0', 'iothread': 'iothread1' } } }
+{"execute": "object-add", "arguments":{ "qom-type": "filter-mirror", "id": "m0", "netdev": "hn0", "queue": "tx", "outdev": "mirror0" } }
+{"execute": "object-add", "arguments":{ "qom-type": "filter-redirector", "id": "redire0", "netdev": "hn0", "queue": "rx", "indev": "compare_out" } }
+{"execute": "object-add", "arguments":{ "qom-type": "filter-redirector", "id": "redire1", "netdev": "hn0", "queue": "rx", "outdev": "compare0" } }
+{"execute": "object-add", "arguments":{ "qom-type": "iothread", "id": "iothread1" } }
+{"execute": "object-add", "arguments":{ "qom-type": "colo-compare", "id": "comp0", "primary_in": "compare0-0", "secondary_in": "compare1", "outdev": "compare_out0", "iothread": "iothread1" } }
-{'execute': 'migrate-set-capabilities', 'arguments':{ 'capabilities': [ {'capability': 'x-colo', 'state': true } ] } }
-{'execute': 'migrate', 'arguments':{ 'uri': 'tcp:127.0.0.2:9998' } }
+{"execute": "migrate-set-capabilities", "arguments":{ "capabilities": [ {"capability": "x-colo", "state": true } ] } }
+{"execute": "migrate", "arguments":{ "uri": "tcp:127.0.0.2:9998" } }
Note:
If this Primary previously was a Secondary, then we need to insert the
filters before the filter-rewriter by using the
-"'insert': 'before', 'position': 'id=rew0'" Options. See below.
+""insert": "before", "position": "id=rew0"" Options. See below.
== Secondary resume replication ==
Become Primary and resume replication after new Secondary is up. Note
@@ -309,23 +310,23 @@ that now 127.0.0.1 is the Secondary and 127.0.0.2 is the Primary.
Start the new Secondary (Steps 2 and 3 above, but with primary_ip=127.0.0.2),
then on the old Secondary:
-{'execute': 'drive-mirror', 'arguments':{ 'device': 'colo-disk0', 'job-id': 'resync', 'target': 'nbd://127.0.0.1:9999/parent0', 'mode': 'existing', 'format': 'raw', 'sync': 'full'} }
+{"execute": "drive-mirror", "arguments":{ "device": "colo-disk0", "job-id": "resync", "target": "nbd://127.0.0.1:9999/parent0", "mode": "existing", "format": "raw", "sync": "full"} }
Wait until disk is synced, then:
-{'execute': 'stop'}
-{'execute': 'block-job-cancel', 'arguments':{ 'device': 'resync' } }
+{"execute": "stop"}
+{"execute": "block-job-cancel", "arguments":{ "device": "resync" } }
-{'execute': 'human-monitor-command', 'arguments':{ 'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.1,file.port=9999,file.export=parent0,node-name=replication0'}}
-{'execute': 'x-blockdev-change', 'arguments':{ 'parent': 'colo-disk0', 'node': 'replication0' } }
+{"execute": "human-monitor-command", "arguments":{ "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=127.0.0.1,file.port=9999,file.export=parent0,node-name=replication0"}}
+{"execute": "x-blockdev-change", "arguments":{ "parent": "colo-disk0", "node": "replication0" } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-mirror', 'id': 'm0', 'props': { 'insert': 'before', 'position': 'id=rew0', 'netdev': 'hn0', 'queue': 'tx', 'outdev': 'mirror0' } } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-redirector', 'id': 'redire0', 'props': { 'insert': 'before', 'position': 'id=rew0', 'netdev': 'hn0', 'queue': 'rx', 'indev': 'compare_out' } } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'filter-redirector', 'id': 'redire1', 'props': { 'insert': 'before', 'position': 'id=rew0', 'netdev': 'hn0', 'queue': 'rx', 'outdev': 'compare0' } } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'iothread', 'id': 'iothread1' } }
-{'execute': 'object-add', 'arguments':{ 'qom-type': 'colo-compare', 'id': 'comp0', 'props': { 'primary_in': 'compare0-0', 'secondary_in': 'compare1', 'outdev': 'compare_out0', 'iothread': 'iothread1' } } }
+{"execute": "object-add", "arguments":{ "qom-type": "filter-mirror", "id": "m0", "insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "tx", "outdev": "mirror0" } }
+{"execute": "object-add", "arguments":{ "qom-type": "filter-redirector", "id": "redire0", "insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "rx", "indev": "compare_out" } }
+{"execute": "object-add", "arguments":{ "qom-type": "filter-redirector", "id": "redire1", "insert": "before", "position": "id=rew0", "netdev": "hn0", "queue": "rx", "outdev": "compare0" } }
+{"execute": "object-add", "arguments":{ "qom-type": "iothread", "id": "iothread1" } }
+{"execute": "object-add", "arguments":{ "qom-type": "colo-compare", "id": "comp0", "primary_in": "compare0-0", "secondary_in": "compare1", "outdev": "compare_out0", "iothread": "iothread1" } }
-{'execute': 'migrate-set-capabilities', 'arguments':{ 'capabilities': [ {'capability': 'x-colo', 'state': true } ] } }
-{'execute': 'migrate', 'arguments':{ 'uri': 'tcp:127.0.0.1:9998' } }
+{"execute": "migrate-set-capabilities", "arguments":{ "capabilities": [ {"capability": "x-colo", "state": true } ] } }
+{"execute": "migrate", "arguments":{ "uri": "tcp:127.0.0.1:9998" } }
== TODO ==
1. Support shared storage.
diff --git a/docs/about/build-platforms.rst b/docs/about/build-platforms.rst
index bcb1549721..8fd7da140a 100644
--- a/docs/about/build-platforms.rst
+++ b/docs/about/build-platforms.rst
@@ -41,23 +41,25 @@ Those hosts are officially supported, with various accelerators:
- Accelerators
* - Arm
- kvm (64 bit only), tcg, xen
- * - MIPS
+ * - MIPS (little endian only)
- kvm, tcg
* - PPC
- kvm, tcg
* - RISC-V
- - tcg
+ - kvm, tcg
* - s390x
- kvm, tcg
* - SPARC
- tcg
* - x86
- - hax, hvf (64 bit only), kvm, nvmm, tcg, whpx (64 bit only), xen
+ - hvf (64 bit only), kvm, nvmm, tcg, whpx (64 bit only), xen
-Other host architectures are not supported. It is possible to build QEMU on an
-unsupported host architecture using the configure ``--enable-tcg-interpreter``
-option to enable the experimental TCI support, but note that this is very slow
-and is not recommended.
+Other host architectures are not supported. It is possible to build QEMU system
+emulation on an unsupported host architecture using the configure
+``--enable-tcg-interpreter`` option to enable the TCI support, but note that
+this is very slow and is not recommended for normal use. QEMU user emulation
+requires host-specific support for signal handling, therefore TCI won't help
+on unsupported host architectures.
Non-supported architectures may be removed in the future following the
:ref:`deprecation process<Deprecated features>`.
@@ -65,11 +67,15 @@ Non-supported architectures may be removed in the future following the
Linux OS, macOS, FreeBSD, NetBSD, OpenBSD
-----------------------------------------
-The project aims to support the most recent major version at all times. Support
+The project aims to support the most recent major version at all times for
+up to five years after its initial release. Support
for the previous major version will be dropped 2 years after the new major
version is released or when the vendor itself drops support, whichever comes
first. In this context, third-party efforts to extend the lifetime of a distro
-are not considered, even when they are endorsed by the vendor (eg. Debian LTS).
+are not considered, even when they are endorsed by the vendor (eg. Debian LTS);
+the same is true of repositories that contain packages backported from later
+releases (e.g. Debian backports). Within each major release, only the most
+recent minor release is considered.
For the purposes of identifying supported software versions available on Linux,
the project will look at CentOS, Debian, Fedora, openSUSE, RHEL, SLES and
@@ -78,18 +84,64 @@ Ubuntu LTS. Other distros will be assumed to ship similar software versions.
For FreeBSD and OpenBSD, decisions will be made based on the contents of the
respective ports repository, while NetBSD will use the pkgsrc repository.
-For macOS, `HomeBrew`_ will be used, although `MacPorts`_ is expected to carry
+For macOS, `Homebrew`_ will be used, although `MacPorts`_ is expected to carry
similar versions.
-Windows
--------
+Some build dependencies may follow less conservative rules:
+
+Python runtime
+ Distributions with long-term support often provide multiple versions
+ of the Python runtime. While QEMU will initially aim to support the
+ distribution's default runtime, it may later increase its minimum version
+ to any newer python that is available as an option from the vendor.
+ In this case, it will be necessary to use the ``--python`` command line
+ option of the ``configure`` script to point QEMU to a supported
+ version of the Python runtime.
+
+ As of QEMU |version|, the minimum supported version of Python is 3.7.
+
+Python build dependencies
+ Some of QEMU's build dependencies are written in Python. Usually these
+ are only packaged by distributions for the default Python runtime.
+ If QEMU bumps its minimum Python version and a non-default runtime is
+ required, it may be necessary to fetch python modules from the Python
+ Package Index (PyPI) via ``pip``, in order to build QEMU.
+
+Optional build dependencies
+ Build components whose absence does not affect the ability to build
+ QEMU may not be available in distros, or may be too old for QEMU's
+ requirements. Many of these, such as the Avocado testing framework
+ or various linters, are written in Python and therefore can also
+ be installed using ``pip``. Cross compilers are another example
+ of optional build-time dependency; in this case it is possible to
+ download them from repositories such as EPEL, to use container-based
+ cross compilation using ``docker`` or ``podman``, or to use pre-built
+ binaries distributed with QEMU.
-The project supports building with current versions of the MinGW toolchain,
-hosted on Linux (Debian/Fedora).
-The version of the Windows API that's currently targeted is Vista / Server
-2008.
+Windows
+-------
-.. _HomeBrew: https://brew.sh/
+The project aims to support the two most recent versions of Windows that are
+still supported by the vendor. The minimum Windows API that is currently
+targeted is "Windows 8", so theoretically the QEMU binaries can still be run
+on older versions of Windows, too. However, such old versions of Windows are
+not tested anymore, so it is recommended to use one of the latest versions of
+Windows instead.
+
+The project supports building QEMU with current versions of the MinGW
+toolchain, either hosted on Linux (Debian/Fedora) or via `MSYS2`_ on Windows.
+A more recent Windows version is always preferred as it is less likely to have
+problems with building via MSYS2. The building process of QEMU involves some
+Python scripts that call os.symlink() which needs special attention for the
+build process to successfully complete. On newer versions of Windows 10,
+unprivileged accounts can create symlinks if Developer Mode is enabled.
+When Developer Mode is not available/enabled, the SeCreateSymbolicLinkPrivilege
+privilege is required, or the process must be run as an administrator.
+
+Only 64-bit Windows is supported.
+
+.. _Homebrew: https://brew.sh/
.. _MacPorts: https://www.macports.org/
+.. _MSYS2: https://www.msys2.org/
.. _Repology: https://repology.org/
diff --git a/docs/about/deprecated.rst b/docs/about/deprecated.rst
index 3c2be84d80..7b8aafa15b 100644
--- a/docs/about/deprecated.rst
+++ b/docs/about/deprecated.rst
@@ -23,37 +23,6 @@ deprecated.
System emulator command line arguments
--------------------------------------
-``QEMU_AUDIO_`` environment variables and ``-audio-help`` (since 4.0)
-'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-The ``-audiodev`` argument is now the preferred way to specify audio
-backend settings instead of environment variables. To ease migration to
-the new format, the ``-audiodev-help`` option can be used to convert
-the current values of the environment variables to ``-audiodev`` options.
-
-Creating sound card devices and vnc without ``audiodev=`` property (since 4.2)
-''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-When not using the deprecated legacy audio config, each sound card
-should specify an ``audiodev=`` property. Additionally, when using
-vnc, you should specify an ``audiodev=`` property if you plan to
-transmit audio through the VNC protocol.
-
-Creating sound card devices using ``-soundhw`` (since 5.1)
-''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-Sound card devices should be created using ``-device`` instead. The
-names are the same for most devices. The exceptions are ``hda`` which
-needs two devices (``-device intel-hda -device hda-duplex``) and
-``pcspk`` which can be activated using ``-machine
-pcspk-audiodev=<name>``.
-
-``-chardev`` backend aliases ``tty`` and ``parport`` (since 6.0)
-''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-``tty`` and ``parport`` are aliases that will be removed. Instead, the
-actual backend names ``serial`` and ``parallel`` should be used.
-
Short-form boolean options (since 6.0)
''''''''''''''''''''''''''''''''''''''
@@ -67,100 +36,6 @@ and will cause a warning.
The replacement for the ``nodelay`` short-form boolean option is ``nodelay=on``
rather than ``delay=off``.
-``--enable-fips`` (since 6.0)
-'''''''''''''''''''''''''''''
-
-This option restricts usage of certain cryptographic algorithms when
-the host is operating in FIPS mode.
-
-If FIPS compliance is required, QEMU should be built with the ``libgcrypt``
-library enabled as a cryptography provider.
-
-Neither the ``nettle`` library, or the built-in cryptography provider are
-supported on FIPS enabled hosts.
-
-``-writeconfig`` (since 6.0)
-'''''''''''''''''''''''''''''
-
-The ``-writeconfig`` option is not able to serialize the entire contents
-of the QEMU command line. It is thus considered a failed experiment
-and deprecated, with no current replacement.
-
-Userspace local APIC with KVM (x86, since 6.0)
-''''''''''''''''''''''''''''''''''''''''''''''
-
-Using ``-M kernel-irqchip=off`` with x86 machine types that include a local
-APIC is deprecated. The ``split`` setting is supported, as is using
-``-M kernel-irqchip=off`` with the ISA PC machine type.
-
-hexadecimal sizes with scaling multipliers (since 6.0)
-''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-Input parameters that take a size value should only use a size suffix
-(such as 'k' or 'M') when the base is written in decimal, and not when
-the value is hexadecimal. That is, '0x20M' is deprecated, and should
-be written either as '32M' or as '0x2000000'.
-
-``-spice password=string`` (since 6.0)
-''''''''''''''''''''''''''''''''''''''
-
-This option is insecure because the SPICE password remains visible in
-the process listing. This is replaced by the new ``password-secret``
-option which lets the password be securely provided on the command
-line using a ``secret`` object instance.
-
-``opened`` property of ``rng-*`` objects (since 6.0)
-''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-The only effect of specifying ``opened=on`` in the command line or QMP
-``object-add`` is that the device is opened immediately, possibly before all
-other options have been processed. This will either have no effect (if
-``opened`` was the last option) or cause errors. The property is therefore
-useless and should not be specified.
-
-``loaded`` property of ``secret`` and ``secret_keyring`` objects (since 6.0)
-''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-The only effect of specifying ``loaded=on`` in the command line or QMP
-``object-add`` is that the secret is loaded immediately, possibly before all
-other options have been processed. This will either have no effect (if
-``loaded`` was the last option) or cause options to be effectively ignored as
-if they were not given. The property is therefore useless and should not be
-specified.
-
-``-display sdl,window_close=...`` (since 6.1)
-'''''''''''''''''''''''''''''''''''''''''''''
-
-Use ``-display sdl,window-close=...`` instead (i.e. with a minus instead of
-an underscore between "window" and "close").
-
-``-no-quit`` (since 6.1)
-''''''''''''''''''''''''
-
-The ``-no-quit`` is a synonym for ``-display ...,window-close=off`` which
-should be used instead.
-
-``-alt-grab`` and ``-display sdl,alt_grab=on`` (since 6.2)
-''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-Use ``-display sdl,grab-mod=lshift-lctrl-lalt`` instead.
-
-``-ctrl-grab`` and ``-display sdl,ctrl_grab=on`` (since 6.2)
-''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-Use ``-display sdl,grab-mod=rctrl`` instead.
-
-``-sdl`` (since 6.2)
-''''''''''''''''''''
-
-Use ``-display sdl`` instead.
-
-``-curses`` (since 6.2)
-'''''''''''''''''''''''
-
-Use ``-display curses`` instead.
-
-
Plugin argument passing through ``arg=<string>`` (since 6.1)
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
@@ -172,6 +47,29 @@ as short-form boolean values, and passed to plugins as ``arg_name=on``.
However, short-form booleans are deprecated and full explicit ``arg_name=on``
form is preferred.
+``-smp`` (Unsupported "parameter=1" SMP configurations) (since 9.0)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Specified CPU topology parameters must be supported by the machine.
+
+In the SMP configuration, users should provide the CPU topology parameters that
+are supported by the target machine.
+
+However, historically it was allowed for users to specify the unsupported
+topology parameter as "1", which is meaningless. So support for this kind of
+configurations (e.g. -smp drawers=1,books=1,clusters=1 for x86 PC machine) is
+marked deprecated since 9.0, users have to ensure that all the topology members
+described with -smp are supported by the target machine.
+
+User-mode emulator command line arguments
+-----------------------------------------
+
+``-p`` (since 9.0)
+''''''''''''''''''
+
+The ``-p`` option pretends to control the host page size. However,
+it is not possible to change the host page size, and using the
+option only causes failures.
QEMU Machine Protocol (QMP) commands
------------------------------------
@@ -213,40 +111,143 @@ Use the more generic commands ``block-export-add`` and ``block-export-del``
instead. As part of this deprecation, where ``nbd-server-add`` used a
single ``bitmap``, the new ``block-export-add`` uses a list of ``bitmaps``.
-System accelerators
--------------------
+``query-qmp-schema`` return value member ``values`` (since 6.2)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Member ``values`` in return value elements with meta-type ``enum`` is
+deprecated. Use ``members`` instead.
-MIPS ``Trap-and-Emul`` KVM support (since 6.0)
-''''''''''''''''''''''''''''''''''''''''''''''
+``drive-backup`` (since 6.2)
+''''''''''''''''''''''''''''
-The MIPS ``Trap-and-Emul`` KVM host and guest support has been removed
-from Linux upstream kernel, declare it deprecated.
+Use ``blockdev-backup`` in combination with ``blockdev-add`` instead.
+This change primarily separates the creation/opening process of the backup
+target with explicit, separate steps. ``blockdev-backup`` uses mostly the
+same arguments as ``drive-backup``, except the ``format`` and ``mode``
+options are removed in favor of using explicit ``blockdev-create`` and
+``blockdev-add`` calls. See :doc:`/interop/live-block-operations` for
+details.
-System emulator CPUS
+Incorrectly typed ``device_add`` arguments (since 6.2)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Due to shortcomings in the internal implementation of ``device_add``, QEMU
+incorrectly accepts certain invalid arguments: Any object or list arguments are
+silently ignored. Other argument types are not checked, but an implicit
+conversion happens, so that e.g. string values can be assigned to integer
+device properties or vice versa.
+
+This is a bug in QEMU that will be fixed in the future so that previously
+accepted incorrect commands will return an error. Users should make sure that
+all arguments passed to ``device_add`` are consistent with the documented
+property types.
+
+QEMU Machine Protocol (QMP) events
+----------------------------------
+
+``MEM_UNPLUG_ERROR`` (since 6.2)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Use the more generic event ``DEVICE_UNPLUG_GUEST_ERROR`` instead.
+
+``vcpu`` trace events (since 8.1)
+'''''''''''''''''''''''''''''''''
+
+The ability to instrument QEMU helper functions with vCPU-aware trace
+points was removed in 7.0. However QMP still exposed the vcpu
+parameter. This argument has now been deprecated and the remaining
+remaining trace points that used it are selected just by name.
+
+Host Architectures
+------------------
+
+BE MIPS (since 7.2)
+'''''''''''''''''''
+
+As Debian 10 ("Buster") moved into LTS the big endian 32 bit version of
+MIPS moved out of support making it hard to maintain our
+cross-compilation CI tests of the architecture. As we no longer have
+CI coverage support may bitrot away before the deprecation process
+completes. The little endian variants of MIPS (both 32 and 64 bit) are
+still a supported host architecture.
+
+System emulation on 32-bit x86 hosts (since 8.0)
+''''''''''''''''''''''''''''''''''''''''''''''''
+
+Support for 32-bit x86 host deployments is increasingly uncommon in mainstream
+OS distributions given the widespread availability of 64-bit x86 hardware.
+The QEMU project no longer considers 32-bit x86 support for system emulation to
+be an effective use of its limited resources, and thus intends to discontinue
+it. Since all recent x86 hardware from the past >10 years is capable of the
+64-bit x86 extensions, a corresponding 64-bit OS should be used instead.
+
+
+System emulator CPUs
--------------------
-``Icelake-Client`` CPU Model (since 5.2)
-''''''''''''''''''''''''''''''''''''''''
+``power5+`` and ``power7+`` CPU names (since 9.0)
+'''''''''''''''''''''''''''''''''''''''''''''''''
-``Icelake-Client`` CPU Models are deprecated. Use ``Icelake-Server`` CPU
-Models instead.
+The character "+" in device (and thus also CPU) names is not allowed
+in the QEMU object model anymore. ``power5+``, ``power5+_v2.1``,
+``power7+`` and ``power7+_v2.1`` are currently still supported via
+an alias, but for consistency these will get removed in a future
+release, too. Use ``power5p_v2.1`` and ``power7p_v2.1`` instead.
-MIPS ``I7200`` CPU Model (since 5.2)
-''''''''''''''''''''''''''''''''''''
+CRIS CPU architecture (since 9.0)
+'''''''''''''''''''''''''''''''''
-The ``I7200`` guest CPU relies on the nanoMIPS ISA, which is deprecated
-(the ISA has never been upstreamed to a compiler toolchain). Therefore
-this CPU is also deprecated.
+The CRIS architecture was pulled from Linux in 4.17 and the compiler
+is no longer packaged in any distro making it harder to run the
+``check-tcg`` tests. Unless we can improve the testing situation there
+is a chance the code will bitrot without anyone noticing.
System emulator machines
------------------------
-Aspeed ``swift-bmc`` machine (since 6.1)
-''''''''''''''''''''''''''''''''''''''''
+Arm ``virt`` machine ``dtb-kaslr-seed`` property (since 7.1)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The ``dtb-kaslr-seed`` property on the ``virt`` board has been
+deprecated; use the new name ``dtb-randomness`` instead. The new name
+better reflects the way this property affects all random data within
+the device tree blob, not just the ``kaslr-seed`` node.
+
+``pc-i440fx-2.0`` up to ``pc-i440fx-2.3`` (since 8.2)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+These old machine types are quite neglected nowadays and thus might have
+various pitfalls with regards to live migration. Use a newer machine type
+instead.
+
+``shix`` (since 9.0)
+''''''''''''''''''''
-This machine is deprecated because we have enough AST2500 based OpenPOWER
-machines. It can be easily replaced by the ``witherspoon-bmc`` or the
-``romulus-bmc`` machines.
+The machine is no longer in existence and has been long unmaintained
+in QEMU. This also holds for the TC51828 16MiB flash that it uses.
+
+``pseries-2.1`` up to ``pseries-2.12`` (since 9.0)
+''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Older pseries machines before version 3.0 have undergone many changes
+to correct issues, mostly regarding migration compatibility. These are
+no longer maintained and removing them will make the code easier to
+read and maintain. Use versions 3.0 and above as a replacement.
+
+Arm machines ``akita``, ``borzoi``, ``cheetah``, ``connex``, ``mainstone``, ``n800``, ``n810``, ``spitz``, ``terrier``, ``tosa``, ``verdex``, ``z2`` (since 9.0)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+QEMU includes models of some machine types where the QEMU code that
+emulates their SoCs is very old and unmaintained. This code is now
+blocking our ability to move forward with various changes across
+the codebase, and over many years nobody has been interested in
+trying to modernise it. We don't expect any of these machines to have
+a large number of users, because they're all modelling hardware that
+has now passed away into history. We are therefore dropping support
+for all machine types using the PXA2xx and OMAP2 SoCs. We are also
+dropping the ``cheetah`` OMAP1 board, because we don't have any
+test images for it and don't know of anybody who does; the ``sx1``
+and ``sx1-v1`` OMAP1 machines remain supported for now.
Backend options
---------------
@@ -282,6 +283,88 @@ full SCSI support. Use virtio-scsi instead when SCSI passthrough is required.
Note this also applies to ``-device virtio-blk-pci,scsi=on|off``, which is an
alias.
+``-device nvme-ns,eui64-default=on|off`` (since 7.1)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In QEMU versions 6.1, 6.2 and 7.0, the ``nvme-ns`` generates an EUI-64
+identifier that is not globally unique. If an EUI-64 identifier is required, the
+user must set it explicitly using the ``nvme-ns`` device parameter ``eui64``.
+
+``-device nvme,use-intel-id=on|off`` (since 7.1)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``nvme`` device originally used a PCI Vendor/Device Identifier combination
+from Intel that was not properly allocated. Since version 5.2, the controller
+has used a properly allocated identifier. Deprecate the ``use-intel-id``
+machine compatibility parameter.
+
+``-device cxl-type3,memdev=xxxx`` (since 8.0)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``cxl-type3`` device initially only used a single memory backend. With
+the addition of volatile memory support, it is now necessary to distinguish
+between persistent and volatile memory backends. As such, memdev is deprecated
+in favor of persistent-memdev.
+
+``-fsdev proxy`` and ``-virtfs proxy`` (since 8.1)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The 9p ``proxy`` filesystem backend driver has been deprecated and will be
+removed (along with its proxy helper daemon) in a future version of QEMU. Please
+use ``-fsdev local`` or ``-virtfs local`` for using the 9p ``local`` filesystem
+backend, or alternatively consider deploying virtiofsd instead.
+
+The 9p ``proxy`` backend was originally developed as an alternative to the 9p
+``local`` backend. The idea was to enhance security by dispatching actual low
+level filesystem operations from 9p server (QEMU process) over to a separate
+process (the virtfs-proxy-helper binary). However this alternative never gained
+momentum. The proxy backend is much slower than the local backend, hasn't seen
+any development in years, and showed to be less secure, especially due to the
+fact that its helper daemon must be run as root, whereas with the local backend
+QEMU is typically run as unprivileged user and allows to tighten behaviour by
+mapping permissions et al by using its 'mapped' security model option.
+
+Nowadays it would make sense to reimplement the ``proxy`` backend by using
+QEMU's ``vhost`` feature, which would eliminate the high latency costs under
+which the 9p ``proxy`` backend currently suffers. However as of to date nobody
+has indicated plans for such kind of reimplementation unfortunately.
+
+RISC-V 'any' CPU type ``-cpu any`` (since 8.2)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The 'any' CPU type was introduced back in 2018 and has been around since the
+initial RISC-V QEMU port. Its usage has always been unclear: users don't know
+what to expect from a CPU called 'any', and in fact the CPU does not do anything
+special that isn't already done by the default CPUs rv32/rv64.
+
+After the introduction of the 'max' CPU type, RISC-V now has a good coverage
+of generic CPUs: rv32 and rv64 as default CPUs and 'max' as a feature complete
+CPU for both 32 and 64 bit builds. Users are then discouraged to use the 'any'
+CPU type starting in 8.2.
+
+RISC-V CPU properties which start with capital 'Z' (since 8.2)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All RISC-V CPU properties which start with capital 'Z' are being deprecated
+starting in 8.2. The reason is that they were wrongly added with capital 'Z'
+in the past. CPU properties were later added with lower-case names, which
+is the format we want to use from now on.
+
+Users which try to use these deprecated properties will receive a warning
+recommending to switch to their stable counterparts:
+
+- "Zifencei" should be replaced with "zifencei"
+- "Zicsr" should be replaced with "zicsr"
+- "Zihintntl" should be replaced with "zihintntl"
+- "Zihintpause" should be replaced with "zihintpause"
+- "Zawrs" should be replaced with "zawrs"
+- "Zfa" should be replaced with "zfa"
+- "Zfh" should be replaced with "zfh"
+- "Zfhmin" should be replaced with "zfhmin"
+- "Zve32f" should be replaced with "zve32f"
+- "Zve64f" should be replaced with "zve64f"
+- "Zve64d" should be replaced with "zve64d"
+
Block device options
''''''''''''''''''''
@@ -307,25 +390,33 @@ The above, converted to the current supported format::
json:{"file.driver":"rbd", "file.pool":"rbd", "file.image":"name"}
-linux-user mode CPUs
---------------------
+``iscsi,password=xxx`` (since 8.0)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-``ppc64abi32`` CPUs (since 5.2)
-'''''''''''''''''''''''''''''''
+Specifying the iSCSI password in plain text on the command line using the
+``password`` option is insecure. The ``password-secret`` option should be
+used instead, to refer to a ``--object secret...`` instance that provides
+a password via a file, or encrypted.
-The ``ppc64abi32`` architecture has a number of issues which regularly
-trip up our CI testing and is suspected to be quite broken. For that
-reason the maintainers strongly suspect no one actually uses it.
+Character device options
+''''''''''''''''''''''''
-MIPS ``I7200`` CPU (since 5.2)
-''''''''''''''''''''''''''''''
+Backend ``memory`` (since 9.0)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-The ``I7200`` guest CPU relies on the nanoMIPS ISA, which is deprecated
-(the ISA has never been upstreamed to a compiler toolchain). Therefore
-this CPU is also deprecated.
+``memory`` is a deprecated synonym for ``ringbuf``.
+
+CPU device properties
+'''''''''''''''''''''
+
+``pmu-num=n`` on RISC-V CPUs (since 8.2)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+In order to support more flexible counter configurations this has been replaced
+by a ``pmu-mask`` property. If set of counters is continuous then the mask can
+be calculated with ``((2 ^ n) - 1) << 3``. The least significant three bits
+must be left clear.
-Related binaries
-----------------
Backwards compatibility
-----------------------
@@ -356,11 +447,65 @@ versions, aliases will point to newer CPU model versions
depending on the machine type, so management software must
resolve CPU model aliases before starting a virtual machine.
-Guest Emulator ISAs
--------------------
+QEMU guest agent
+----------------
+
+``--blacklist`` command line option (since 7.2)
+'''''''''''''''''''''''''''''''''''''''''''''''
+
+``--blacklist`` has been replaced by ``--block-rpcs`` (which is a better
+wording for what this option does). The short form ``-b`` still stays
+the same and thus is the preferred way for scripts that should run with
+both, older and future versions of QEMU.
+
+``blacklist`` config file option (since 7.2)
+''''''''''''''''''''''''''''''''''''''''''''
+
+The ``blacklist`` config file option has been renamed to ``block-rpcs``
+(to be in sync with the renaming of the corresponding command line
+option).
+
+Migration
+---------
+
+``skipped`` MigrationStats field (since 8.1)
+''''''''''''''''''''''''''''''''''''''''''''
+
+``skipped`` field in Migration stats has been deprecated. It hasn't
+been used for more than 10 years.
+
+``inc`` migrate command option (since 8.2)
+''''''''''''''''''''''''''''''''''''''''''
+
+Use blockdev-mirror with NBD instead.
+
+As an intermediate step the ``inc`` functionality can be achieved by
+setting the ``block-incremental`` migration parameter to ``true``.
+But this parameter is also deprecated.
+
+``blk`` migrate command option (since 8.2)
+''''''''''''''''''''''''''''''''''''''''''
+
+Use blockdev-mirror with NBD instead.
+
+As an intermediate step the ``blk`` functionality can be achieved by
+setting the ``block`` migration capability to ``true``. But this
+capability is also deprecated.
+
+block migration (since 8.2)
+'''''''''''''''''''''''''''
+
+Block migration is too inflexible. It needs to migrate all block
+devices or none.
+
+Please see "QMP invocation for live storage migration with
+``blockdev-mirror`` + NBD" in docs/interop/live-block-operations.rst
+for a detailed explanation.
-nanoMIPS ISA
-''''''''''''
+old compression method (since 8.2)
+''''''''''''''''''''''''''''''''''
-The ``nanoMIPS`` ISA has never been upstreamed to any compiler toolchain.
-As it is hard to generate binaries for it, declare it deprecated.
+Compression method fails too much. Too many races. We are going to
+remove it if nobody fixes it. For starters, migration-test
+compression tests are disabled because they fail randomly. If you need
+compression, use multifd compression methods.
diff --git a/docs/about/emulation.rst b/docs/about/emulation.rst
new file mode 100644
index 0000000000..b5ff9c5f69
--- /dev/null
+++ b/docs/about/emulation.rst
@@ -0,0 +1,184 @@
+Emulation
+=========
+
+QEMU's Tiny Code Generator (TCG) provides the ability to emulate a
+number of CPU architectures on any supported host platform. Both
+:ref:`System Emulation` and :ref:`User Mode Emulation` are supported
+depending on the guest architecture.
+
+.. list-table:: Supported Guest Architectures for Emulation
+ :widths: 30 10 10 50
+ :header-rows: 1
+
+ * - Architecture (qemu name)
+ - System
+ - User
+ - Notes
+ * - Alpha
+ - Yes
+ - Yes
+ - Legacy 64 bit RISC ISA developed by DEC
+ * - Arm (arm, aarch64)
+ - :ref:`Yes<ARM-System-emulator>`
+ - Yes
+ - Wide range of features, see :ref:`Arm Emulation` for details
+ * - AVR
+ - :ref:`Yes<AVR-System-emulator>`
+ - No
+ - 8 bit micro controller, often used in maker projects
+ * - Cris
+ - Yes
+ - Yes
+ - Embedded RISC chip developed by AXIS
+ * - Hexagon
+ - No
+ - Yes
+ - Family of DSPs by Qualcomm
+ * - PA-RISC (hppa)
+ - Yes
+ - Yes
+ - A legacy RISC system used in HP's old minicomputers
+ * - x86 (i386, x86_64)
+ - :ref:`Yes<QEMU-PC-System-emulator>`
+ - Yes
+ - The ubiquitous desktop PC CPU architecture, 32 and 64 bit.
+ * - Loongarch
+ - Yes
+ - Yes
+ - A MIPS-like 64bit RISC architecture developed in China
+ * - m68k
+ - :ref:`Yes<ColdFire-System-emulator>`
+ - Yes
+ - Motorola 68000 variants and ColdFire
+ * - Microblaze
+ - Yes
+ - Yes
+ - RISC based soft-core by Xilinx
+ * - MIPS (mips*)
+ - :ref:`Yes<MIPS-System-emulator>`
+ - Yes
+ - Venerable RISC architecture originally out of Stanford University
+ * - OpenRISC
+ - :ref:`Yes<OpenRISC-System-emulator>`
+ - Yes
+ - Open source RISC architecture developed by the OpenRISC community
+ * - Power (ppc, ppc64)
+ - :ref:`Yes<PowerPC-System-emulator>`
+ - Yes
+ - A general purpose RISC architecture now managed by IBM
+ * - RISC-V
+ - :ref:`Yes<RISC-V-System-emulator>`
+ - Yes
+ - An open standard RISC ISA maintained by RISC-V International
+ * - RX
+ - :ref:`Yes<RX-System-emulator>`
+ - No
+ - A 32 bit micro controller developed by Renesas
+ * - s390x
+ - :ref:`Yes<s390x-System-emulator>`
+ - Yes
+ - A 64 bit CPU found in IBM's System Z mainframes
+ * - sh4
+ - Yes
+ - Yes
+ - A 32 bit RISC embedded CPU developed by Hitachi
+ * - SPARC (sparc, sparc64)
+ - :ref:`Yes<Sparc32-System-emulator>`
+ - Yes
+ - A RISC ISA originally developed by Sun Microsystems
+ * - Tricore
+ - Yes
+ - No
+ - A 32 bit RISC/uController/DSP developed by Infineon
+ * - Xtensa
+ - :ref:`Yes<Xtensa-System-emulator>`
+ - Yes
+ - A configurable 32 bit soft core now owned by Cadence
+
+A number of features are only available when running under
+emulation including :ref:`Record/Replay<replay>` and :ref:`TCG Plugins`.
+
+.. _Semihosting:
+
+Semihosting
+-----------
+
+Semihosting is a feature defined by the owner of the architecture to
+allow programs to interact with a debugging host system. On real
+hardware this is usually provided by an In-circuit emulator (ICE)
+hooked directly to the board. QEMU's implementation allows for
+semihosting calls to be passed to the host system or via the
+``gdbstub``.
+
+Generally semihosting makes it easier to bring up low level code before a
+more fully functional operating system has been enabled. On QEMU it
+also allows for embedded micro-controller code which typically doesn't
+have a full libc to be run as "bare-metal" code under QEMU's user-mode
+emulation. It is also useful for writing test cases and indeed a
+number of compiler suites as well as QEMU itself use semihosting calls
+to exit test code while reporting the success state.
+
+Semihosting is only available using TCG emulation. This is because the
+instructions to trigger a semihosting call are typically reserved
+causing most hypervisors to trap and fault on them.
+
+.. warning::
+ Semihosting inherently bypasses any isolation there may be between
+ the guest and the host. As a result a program using semihosting can
+ happily trash your host system. Some semihosting calls (e.g.
+ ``SYS_READC``) can block execution indefinitely. You should only
+ ever run trusted code with semihosting enabled.
+
+Redirection
+~~~~~~~~~~~
+
+Semihosting calls can be re-directed to a (potentially remote) gdb
+during debugging via the :ref:`gdbstub<GDB usage>`. Output to the
+semihosting console is configured as a ``chardev`` so can be
+redirected to a file, pipe or socket like any other ``chardev``
+device.
+
+Supported Targets
+~~~~~~~~~~~~~~~~~
+
+Most targets offer similar semihosting implementations with some
+minor changes to define the appropriate instruction to encode the
+semihosting call and which registers hold the parameters. They tend to
+presents a simple POSIX-like API which allows your program to read and
+write files, access the console and some other basic interactions.
+
+For full details of the ABI for a particular target, and the set of
+calls it provides, you should consult the semihosting specification
+for that architecture.
+
+.. note::
+ QEMU makes an implementation decision to implement all file
+ access in ``O_BINARY`` mode. The user-visible effect of this is
+ regardless of the text/binary mode the program sets QEMU will
+ always select a binary mode ensuring no line-terminator conversion
+ is performed on input or output. This is because gdb semihosting
+ support doesn't make the distinction between the modes and
+ magically processing line endings can be confusing.
+
+.. list-table:: Guest Architectures supporting Semihosting
+ :widths: 10 10 80
+ :header-rows: 1
+
+ * - Architecture
+ - Modes
+ - Specification
+ * - Arm
+ - System and User-mode
+ - https://github.com/ARM-software/abi-aa/blob/main/semihosting/semihosting.rst
+ * - m68k
+ - System
+ - https://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=libgloss/m68k/m68k-semi.txt;hb=HEAD
+ * - MIPS
+ - System
+ - Unified Hosting Interface (MD01069)
+ * - RISC-V
+ - System and User-mode
+ - https://github.com/riscv/riscv-semihosting-spec/blob/main/riscv-semihosting-spec.adoc
+ * - Xtensa
+ - System
+ - Tensilica ISS SIMCALL
diff --git a/docs/about/index.rst b/docs/about/index.rst
index 5bea653c07..4f96ab5d91 100644
--- a/docs/about/index.rst
+++ b/docs/about/index.rst
@@ -5,24 +5,25 @@ About QEMU
QEMU is a generic and open source machine emulator and virtualizer.
QEMU can be used in several different ways. The most common is for
-"system emulation", where it provides a virtual model of an
+:ref:`System Emulation`, where it provides a virtual model of an
entire machine (CPU, memory and emulated devices) to run a guest OS.
-In this mode the CPU may be fully emulated, or it may work with
-a hypervisor such as KVM, Xen, Hax or Hypervisor.Framework to
-allow the guest to run directly on the host CPU.
+In this mode the CPU may be fully emulated, or it may work with a
+hypervisor such as KVM, Xen or Hypervisor.Framework to allow the
+guest to run directly on the host CPU.
-The second supported way to use QEMU is "user mode emulation",
+The second supported way to use QEMU is :ref:`User Mode Emulation`,
where QEMU can launch processes compiled for one CPU on another CPU.
In this mode the CPU is always emulated.
-QEMU also provides a number of standalone commandline utilities,
-such as the ``qemu-img`` disk image utility that allows you to create,
-convert and modify disk images.
+QEMU also provides a number of standalone :ref:`command line
+utilities<Tools>`, such as the ``qemu-img`` disk image utility that
+allows you to create, convert and modify disk images.
.. toctree::
:maxdepth: 2
build-platforms
+ emulation
deprecated
removed-features
license
diff --git a/docs/about/license.rst b/docs/about/license.rst
index cde3d2d25d..303c55d61b 100644
--- a/docs/about/license.rst
+++ b/docs/about/license.rst
@@ -8,4 +8,4 @@ QEMU is a trademark of Fabrice Bellard.
QEMU is released under the `GNU General Public
License <https://www.gnu.org/licenses/gpl-2.0.txt>`__, version 2. Parts
of QEMU have specific licenses, see file
-`LICENSE <https://git.qemu.org/?p=qemu.git;a=blob_plain;f=LICENSE>`__.
+`LICENSE <https://gitlab.com/qemu-project/qemu/-/raw/master/LICENSE>`__.
diff --git a/docs/about/removed-features.rst b/docs/about/removed-features.rst
index 9d0d90c90d..53ca08aba9 100644
--- a/docs/about/removed-features.rst
+++ b/docs/about/removed-features.rst
@@ -330,6 +330,192 @@ RISC-V firmware not booted by default (removed in 5.1)
QEMU 5.1 changes the default behaviour from ``-bios none`` to ``-bios default``
for the RISC-V ``virt`` machine and ``sifive_u`` machine.
+``-no-quit`` (removed in 7.0)
+'''''''''''''''''''''''''''''
+
+The ``-no-quit`` was a synonym for ``-display ...,window-close=off`` which
+should be used instead.
+
+``--enable-fips`` (removed in 7.1)
+''''''''''''''''''''''''''''''''''
+
+This option restricted usage of certain cryptographic algorithms when
+the host is operating in FIPS mode.
+
+If FIPS compliance is required, QEMU should be built with the ``libgcrypt``
+or ``gnutls`` library enabled as a cryptography provider.
+
+Neither the ``nettle`` library, or the built-in cryptography provider are
+supported on FIPS enabled hosts.
+
+``-writeconfig`` (removed in 7.1)
+'''''''''''''''''''''''''''''''''
+
+The ``-writeconfig`` option was not able to serialize the entire contents
+of the QEMU command line. It is thus considered a failed experiment
+and removed without a replacement.
+
+``loaded`` property of ``secret`` and ``secret_keyring`` objects (removed in 7.1)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The ``loaded=on`` option in the command line or QMP ``object-add`` either had
+no effect (if ``loaded`` was the last option) or caused options to be
+effectively ignored as if they were not given. The property is therefore
+useless and should simply be removed.
+
+``opened`` property of ``rng-*`` objects (removed in 7.1)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The ``opened=on`` option in the command line or QMP ``object-add`` either had
+no effect (if ``opened`` was the last option) or caused errors. The property
+is therefore useless and should simply be removed.
+
+``-display sdl,window_close=...`` (removed in 7.1)
+''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Use ``-display sdl,window-close=...`` instead (i.e. with a minus instead of
+an underscore between "window" and "close").
+
+``-alt-grab`` and ``-display sdl,alt_grab=on`` (removed in 7.1)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Use ``-display sdl,grab-mod=lshift-lctrl-lalt`` instead.
+
+``-ctrl-grab`` and ``-display sdl,ctrl_grab=on`` (removed in 7.1)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Use ``-display sdl,grab-mod=rctrl`` instead.
+
+``-sdl`` (removed in 7.1)
+'''''''''''''''''''''''''
+
+Use ``-display sdl`` instead.
+
+``-curses`` (removed in 7.1)
+''''''''''''''''''''''''''''
+
+Use ``-display curses`` instead.
+
+Creating sound card devices using ``-soundhw`` (removed in 7.1)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Sound card devices should be created using ``-device`` or ``-audio``.
+The exception is ``pcspk`` which can be activated using ``-machine
+pcspk-audiodev=<name>``.
+
+``-watchdog`` (since 7.2)
+'''''''''''''''''''''''''
+
+Use ``-device`` instead.
+
+Hexadecimal sizes with scaling multipliers (since 8.0)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Input parameters that take a size value should only use a size suffix
+(such as 'k' or 'M') when the base is written in decimal, and not when
+the value is hexadecimal. That is, '0x20M' should be written either as
+'32M' or as '0x2000000'.
+
+``-chardev`` backend aliases ``tty`` and ``parport`` (removed in 8.0)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+``tty`` and ``parport`` used to be aliases for ``serial`` and ``parallel``
+respectively. The actual backend names should be used instead.
+
+``-drive if=none`` for the sifive_u OTP device (removed in 8.0)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Use ``-drive if=pflash`` to configure the OTP device of the sifive_u
+RISC-V machine instead.
+
+``-spice password=string`` (removed in 8.0)
+'''''''''''''''''''''''''''''''''''''''''''
+
+This option was insecure because the SPICE password remained visible in
+the process listing. This was replaced by the new ``password-secret``
+option which lets the password be securely provided on the command
+line using a ``secret`` object instance.
+
+``QEMU_AUDIO_`` environment variables and ``-audio-help`` (removed in 8.2)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The ``-audiodev`` and ``-audio`` command line options are now the only
+way to specify audio backend settings.
+
+Using ``-audiodev`` to define the default audio backend (removed in 8.2)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+If no audiodev property is specified, previous versions would use the
+first ``-audiodev`` command line option as a fallback. Starting with
+version 8.2, audio backends created with ``-audiodev`` will only be
+used by clients (sound cards, machines with embedded sound hardware, VNC)
+that refer to it in an ``audiodev=`` property.
+
+In order to configure a default audio backend, use the ``-audio``
+command line option without specifying a ``model``; while previous
+versions of QEMU required a model, starting with version 8.2
+QEMU does not require a model and will not create any sound card
+in this case.
+
+Note that the default audio backend must be configured on the command
+line if the ``-nodefaults`` options is used.
+
+``-no-hpet`` (removed in 9.0)
+'''''''''''''''''''''''''''''
+
+The HPET setting has been turned into a machine property.
+Use ``-machine hpet=off`` instead.
+
+``-no-acpi`` (removed in 9.0)
+'''''''''''''''''''''''''''''
+
+The ``-no-acpi`` setting has been turned into a machine property.
+Use ``-machine acpi=off`` instead.
+
+``-async-teardown`` (removed in 9.0)
+''''''''''''''''''''''''''''''''''''
+
+Use ``-run-with async-teardown=on`` instead.
+
+``-chroot`` (removed in 9.0)
+''''''''''''''''''''''''''''
+
+Use ``-run-with chroot=dir`` instead.
+
+``-singlestep`` (removed in 9.0)
+''''''''''''''''''''''''''''''''
+
+The ``-singlestep`` option has been turned into an accelerator property,
+and given a name that better reflects what it actually does.
+Use ``-accel tcg,one-insn-per-tb=on`` instead.
+
+``-smp`` ("parameter=0" SMP configurations) (removed in 9.0)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Specified CPU topology parameters must be greater than zero.
+
+In the SMP configuration, users should either provide a CPU topology
+parameter with a reasonable value (greater than zero) or just omit it
+and QEMU will compute the missing value.
+
+However, historically it was implicitly allowed for users to provide
+a parameter with zero value, which is meaningless and could also possibly
+cause unexpected results in the -smp parsing. So support for this kind of
+configurations (e.g. -smp 8,sockets=0) is removed since 9.0, users have
+to ensure that all the topology members described with -smp are greater
+than zero.
+
+User-mode emulator command line arguments
+-----------------------------------------
+
+``-singlestep`` (removed in 9.0)
+''''''''''''''''''''''''''''''''
+
+The ``-singlestep`` option has been given a name that better reflects
+what it actually does. For both linux-user and bsd-user, use the
+``-one-insn-per-tb`` option instead.
+
+
QEMU Machine Protocol (QMP) commands
------------------------------------
@@ -348,7 +534,8 @@ documentation of ``query-hotpluggable-cpus`` for additional details.
``change`` (removed in 6.0)
'''''''''''''''''''''''''''
-Use ``blockdev-change-medium`` or ``change-vnc-password`` instead.
+Use ``blockdev-change-medium`` or ``change-vnc-password`` or
+``display-update`` instead.
``query-events`` (removed in 6.0)
'''''''''''''''''''''''''''''''''
@@ -414,6 +601,19 @@ type of array items in query-named-block-nodes.
Specify the properties for the object as top-level arguments instead.
+``query-sgx`` return value member ``section-size`` (removed in 8.0)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Member ``section-size`` in the return value of ``query-sgx``
+was superseded by ``sections``.
+
+
+``query-sgx-capabilities`` return value member ``section-size`` (removed in 8.0)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Member ``section-size`` in the return value of ``query-sgx-capabilities``
+was superseded by ``sections``.
+
Human Monitor Protocol (HMP) commands
-------------------------------------
@@ -468,6 +668,27 @@ Use ``migrate-set-parameters`` instead.
This command didn't produce any output already. Removed with no replacement.
+``singlestep`` (removed in 9.0)
+'''''''''''''''''''''''''''''''
+
+The ``singlestep`` command has been replaced by the ``one-insn-per-tb``
+command, which has the same behaviour but a less misleading name.
+
+Host Architectures
+------------------
+
+System emulation on 32-bit Windows hosts (removed in 9.0)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+Windows 11 has no support for 32-bit host installs, and Windows 10 did
+not support new 32-bit installs, only upgrades. 32-bit Windows support
+has now been dropped by the MSYS2 project. QEMU also is deprecating
+and dropping support for 32-bit x86 host deployments in
+general. 32-bit Windows is therefore no longer a supported host for
+QEMU. Since all recent x86 hardware from the past >10 years is
+capable of the 64-bit x86 extensions, a corresponding 64-bit OS should
+be used instead.
+
Guest Emulator ISAs
-------------------
@@ -485,9 +706,8 @@ KVM guest support on 32-bit Arm hosts (removed in 5.2)
''''''''''''''''''''''''''''''''''''''''''''''''''''''
The Linux kernel has dropped support for allowing 32-bit Arm systems
-to host KVM guests as of the 5.7 kernel. Accordingly, QEMU is deprecating
-its support for this configuration and will remove it in a future version.
-Running 32-bit guests on a 64-bit Arm host remains supported.
+to host KVM guests as of the 5.7 kernel, and was thus removed from QEMU
+as well. Running 32-bit guests on a 64-bit Arm host remains supported.
RISC-V ISA Specific CPUs (removed in 5.1)
'''''''''''''''''''''''''''''''''''''''''
@@ -531,6 +751,40 @@ Support for this CPU was removed from the upstream Linux kernel, and
there is no available upstream toolchain to build binaries for it.
Removed without replacement.
+x86 ``Icelake-Client`` CPU (removed in 7.1)
+'''''''''''''''''''''''''''''''''''''''''''
+
+There isn't ever Icelake Client CPU, it is some wrong and imaginary one.
+Use ``Icelake-Server`` instead.
+
+Nios II CPU (removed in 9.1)
+''''''''''''''''''''''''''''
+
+QEMU Nios II architecture was orphan; Intel has EOL'ed the Nios II
+processor IP (see `Intel discontinuance notification`_).
+
+System accelerators
+-------------------
+
+Userspace local APIC with KVM (x86, removed in 8.0)
+'''''''''''''''''''''''''''''''''''''''''''''''''''
+
+``-M kernel-irqchip=off`` cannot be used on KVM if the CPU model includes
+a local APIC. The ``split`` setting is supported, as is using ``-M
+kernel-irqchip=off`` when the CPU does not have a local APIC.
+
+HAXM (``-accel hax``) (removed in 8.2)
+''''''''''''''''''''''''''''''''''''''
+
+The HAXM project has been retired (see https://github.com/intel/haxm#status).
+Use "whpx" (on Windows) or "hvf" (on macOS) instead.
+
+MIPS "Trap-and-Emulate" KVM support (removed in 8.0)
+''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The MIPS "Trap-and-Emulate" KVM host and guest support was removed
+from Linux in 2021, and is not supported anymore by QEMU either.
+
System emulator machines
------------------------
@@ -568,8 +822,8 @@ mips ``fulong2e`` machine alias (removed in 6.0)
This machine has been renamed ``fuloong2e``.
-``pc-0.10`` up to ``pc-1.3`` (removed in 4.0 up to 6.0)
-'''''''''''''''''''''''''''''''''''''''''''''''''''''''
+``pc-0.10`` up to ``pc-i440fx-1.7`` (removed in 4.0 up to 8.2)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
These machine types were very old and likely could not be used for live
migration from old QEMU versions anymore. Use a newer machine type instead.
@@ -581,6 +835,22 @@ The Raspberry Pi machines come in various models (A, A+, B, B+). To be able
to distinguish which model QEMU is implementing, the ``raspi2`` and ``raspi3``
machines have been renamed ``raspi2b`` and ``raspi3b``.
+Aspeed ``swift-bmc`` machine (removed in 7.0)
+'''''''''''''''''''''''''''''''''''''''''''''
+
+This machine was removed because it was unused. Alternative AST2500 based
+OpenPOWER machines are ``witherspoon-bmc`` and ``romulus-bmc``.
+
+ppc ``taihu`` machine (removed in 7.2)
+'''''''''''''''''''''''''''''''''''''''''''''
+
+This machine was removed because it was partially emulated and 405
+machines are very similar. Use the ``ref405ep`` machine instead.
+
+Nios II ``10m50-ghrd`` and ``nios2-generic-nommu`` machines (removed in 9.1)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The Nios II architecture was orphan.
linux-user mode CPUs
--------------------
@@ -594,6 +864,32 @@ the upstream Linux kernel in 2018, and it has also been dropped from glibc, so
there is no new Linux development taking place with this architecture. For
running the old binaries, you can use older versions of QEMU.
+``ppc64abi32`` CPUs (removed in 7.0)
+''''''''''''''''''''''''''''''''''''
+
+The ``ppc64abi32`` architecture has a number of issues which regularly
+tripped up the CI testing and was suspected to be quite broken. For that
+reason the maintainers strongly suspected no one actually used it.
+
+``nios2`` CPU (removed in 9.1)
+''''''''''''''''''''''''''''''
+
+QEMU Nios II architecture was orphan; Intel has EOL'ed the Nios II
+processor IP (see `Intel discontinuance notification`_).
+
+TCG introspection features
+--------------------------
+
+TCG trace-events (since 6.2)
+''''''''''''''''''''''''''''
+
+The ability to add new TCG trace points had bit rotted and as the
+feature can be replicated with TCG plugins it was removed. If
+any user is currently using this feature and needs help with
+converting to using TCG plugins they should contact the qemu-devel
+mailing list.
+
+
System emulator devices
-----------------------
@@ -620,6 +916,20 @@ The 'ide-drive' device has been removed. Users should use 'ide-hd' or
The 'scsi-disk' device has been removed. Users should use 'scsi-hd' or
'scsi-cd' as appropriate to get a SCSI hard disk or CD-ROM as needed.
+``sga`` (removed in 8.0)
+''''''''''''''''''''''''
+
+The ``sga`` device loaded an option ROM for x86 targets which enabled
+SeaBIOS to send messages to the serial console. SeaBIOS 1.11.0 onwards
+contains native support for this feature and thus use of the option
+ROM approach was obsolete. The native SeaBIOS support can be activated
+by using ``-machine graphics=off``.
+
+``pvrdma`` and the RDMA subsystem (removed in 9.1)
+''''''''''''''''''''''''''''''''''''''''''''''''''
+
+The 'pvrdma' device and the whole RDMA subsystem have been removed.
+
Related binaries
----------------
@@ -658,8 +968,8 @@ enforce that any failure to open the backing image (including if the
backing file is missing or an incorrect format was specified) is an
error when ``-u`` is not used.
-qemu-img amend to adjust backing file (removed in 6.1)
-''''''''''''''''''''''''''''''''''''''''''''''''''''''
+``qemu-img amend`` to adjust backing file (removed in 6.1)
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
The use of ``qemu-img amend`` to modify the name or format of a qcow2
backing image was never fully documented or tested, and interferes
@@ -670,8 +980,8 @@ backing chain should be performed with ``qemu-img rebase -u`` either
before or after the remaining changes being performed by amend, as
appropriate.
-qemu-img backing file without format (removed in 6.1)
-'''''''''''''''''''''''''''''''''''''''''''''''''''''
+``qemu-img`` backing file without format (removed in 6.1)
+'''''''''''''''''''''''''''''''''''''''''''''''''''''''''
The use of ``qemu-img create``, ``qemu-img rebase``, or ``qemu-img
convert`` to create or modify an image that depends on a backing file
@@ -703,3 +1013,17 @@ The VXHS code did not compile since v2.12.0. It was removed in 5.1.
The corresponding upstream server project is no longer maintained.
Users are recommended to switch to an alternative distributed block
device driver such as RBD.
+
+Tools
+-----
+
+virtiofsd (removed in 8.0)
+''''''''''''''''''''''''''
+
+There is a newer Rust implementation of ``virtiofsd`` at
+``https://gitlab.com/virtio-fs/virtiofsd``; this has been
+stable for some time and is now widely used.
+The command line and feature set is very close to the removed
+C implementation.
+
+.. _Intel discontinuance notification: https://www.intel.com/content/www/us/en/content-details/781327/intel-is-discontinuing-ip-ordering-codes-listed-in-pdn2312-for-nios-ii-ip.html
diff --git a/docs/amd-memory-encryption.txt b/docs/amd-memory-encryption.txt
deleted file mode 100644
index ffca382b5f..0000000000
--- a/docs/amd-memory-encryption.txt
+++ /dev/null
@@ -1,148 +0,0 @@
-Secure Encrypted Virtualization (SEV) is a feature found on AMD processors.
-
-SEV is an extension to the AMD-V architecture which supports running encrypted
-virtual machines (VMs) under the control of KVM. Encrypted VMs have their pages
-(code and data) secured such that only the guest itself has access to the
-unencrypted version. Each encrypted VM is associated with a unique encryption
-key; if its data is accessed by a different entity using a different key the
-encrypted guests data will be incorrectly decrypted, leading to unintelligible
-data.
-
-Key management for this feature is handled by a separate processor known as the
-AMD secure processor (AMD-SP), which is present in AMD SOCs. Firmware running
-inside the AMD-SP provides commands to support a common VM lifecycle. This
-includes commands for launching, snapshotting, migrating and debugging the
-encrypted guest. These SEV commands can be issued via KVM_MEMORY_ENCRYPT_OP
-ioctls.
-
-Secure Encrypted Virtualization - Encrypted State (SEV-ES) builds on the SEV
-support to additionally protect the guest register state. In order to allow a
-hypervisor to perform functions on behalf of a guest, there is architectural
-support for notifying a guest's operating system when certain types of VMEXITs
-are about to occur. This allows the guest to selectively share information with
-the hypervisor to satisfy the requested function.
-
-Launching
----------
-Boot images (such as bios) must be encrypted before a guest can be booted. The
-MEMORY_ENCRYPT_OP ioctl provides commands to encrypt the images: LAUNCH_START,
-LAUNCH_UPDATE_DATA, LAUNCH_MEASURE and LAUNCH_FINISH. These four commands
-together generate a fresh memory encryption key for the VM, encrypt the boot
-images and provide a measurement than can be used as an attestation of a
-successful launch.
-
-For a SEV-ES guest, the LAUNCH_UPDATE_VMSA command is also used to encrypt the
-guest register state, or VM save area (VMSA), for all of the guest vCPUs.
-
-LAUNCH_START is called first to create a cryptographic launch context within
-the firmware. To create this context, guest owner must provide a guest policy,
-its public Diffie-Hellman key (PDH) and session parameters. These inputs
-should be treated as a binary blob and must be passed as-is to the SEV firmware.
-
-The guest policy is passed as plaintext. A hypervisor may choose to read it,
-but should not modify it (any modification of the policy bits will result
-in bad measurement). The guest policy is a 4-byte data structure containing
-several flags that restricts what can be done on a running SEV guest.
-See KM Spec section 3 and 6.2 for more details.
-
-The guest policy can be provided via the 'policy' property (see below)
-
-# ${QEMU} \
- sev-guest,id=sev0,policy=0x1...\
-
-Setting the "SEV-ES required" policy bit (bit 2) will launch the guest as a
-SEV-ES guest (see below)
-
-# ${QEMU} \
- sev-guest,id=sev0,policy=0x5...\
-
-The guest owner provided DH certificate and session parameters will be used to
-establish a cryptographic session with the guest owner to negotiate keys used
-for the attestation.
-
-The DH certificate and session blob can be provided via the 'dh-cert-file' and
-'session-file' properties (see below)
-
-# ${QEMU} \
- sev-guest,id=sev0,dh-cert-file=<file1>,session-file=<file2>
-
-LAUNCH_UPDATE_DATA encrypts the memory region using the cryptographic context
-created via the LAUNCH_START command. If required, this command can be called
-multiple times to encrypt different memory regions. The command also calculates
-the measurement of the memory contents as it encrypts.
-
-LAUNCH_UPDATE_VMSA encrypts all the vCPU VMSAs for a SEV-ES guest using the
-cryptographic context created via the LAUNCH_START command. The command also
-calculates the measurement of the VMSAs as it encrypts them.
-
-LAUNCH_MEASURE can be used to retrieve the measurement of encrypted memory and,
-for a SEV-ES guest, encrypted VMSAs. This measurement is a signature of the
-memory contents and, for a SEV-ES guest, the VMSA contents, that can be sent
-to the guest owner as an attestation that the memory and VMSAs were encrypted
-correctly by the firmware. The guest owner may wait to provide the guest
-confidential information until it can verify the attestation measurement.
-Since the guest owner knows the initial contents of the guest at boot, the
-attestation measurement can be verified by comparing it to what the guest owner
-expects.
-
-LAUNCH_FINISH finalizes the guest launch and destroys the cryptographic
-context.
-
-See SEV KM API Spec [1] 'Launching a guest' usage flow (Appendix A) for the
-complete flow chart.
-
-To launch a SEV guest
-
-# ${QEMU} \
- -machine ...,confidential-guest-support=sev0 \
- -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1
-
-To launch a SEV-ES guest
-
-# ${QEMU} \
- -machine ...,confidential-guest-support=sev0 \
- -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1,policy=0x5
-
-An SEV-ES guest has some restrictions as compared to a SEV guest. Because the
-guest register state is encrypted and cannot be updated by the VMM/hypervisor,
-a SEV-ES guest:
- - Does not support SMM - SMM support requires updating the guest register
- state.
- - Does not support reboot - a system reset requires updating the guest register
- state.
- - Requires in-kernel irqchip - the burden is placed on the hypervisor to
- manage booting APs.
-
-Debugging
------------
-Since the memory contents of a SEV guest are encrypted, hypervisor access to
-the guest memory will return cipher text. If the guest policy allows debugging,
-then a hypervisor can use the DEBUG_DECRYPT and DEBUG_ENCRYPT commands to access
-the guest memory region for debug purposes. This is not supported in QEMU yet.
-
-Snapshot/Restore
------------------
-TODO
-
-Live Migration
-----------------
-TODO
-
-References
------------------
-
-AMD Memory Encryption whitepaper:
-https://developer.amd.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf
-
-Secure Encrypted Virtualization Key Management:
-[1] http://developer.amd.com/wordpress/media/2017/11/55766_SEV-KM-API_Specification.pdf
-
-KVM Forum slides:
-http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf
-https://www.linux-kvm.org/images/9/94/Extending-Secure-Encrypted-Virtualization-with-SEV-ES-Thomas-Lendacky-AMD.pdf
-
-AMD64 Architecture Programmer's Manual:
- http://support.amd.com/TechDocs/24593.pdf
- SME is section 7.10
- SEV is section 15.34
- SEV-ES is section 15.35
diff --git a/docs/block-replication.txt b/docs/block-replication.txt
index 108e9166a8..e1b28a6cc1 100644
--- a/docs/block-replication.txt
+++ b/docs/block-replication.txt
@@ -79,7 +79,7 @@ Primary | || Secondary disk <--------- hidden-disk 5 <---------
|| | |
|| | |
|| '-------------------------'
- || drive-backup sync=none 6
+ || blockdev-backup sync=none 6
1) The disk on the primary is represented by a block device with two
children, providing replication between a primary disk and the host that
@@ -101,7 +101,7 @@ should support bdrv_make_empty() and backing file.
that is modified by the primary VM. It should also start as an empty disk,
and the driver supports bdrv_make_empty() and backing file.
-6) The drive-backup job (sync=none) is run to allow hidden-disk to buffer
+6) The blockdev-backup job (sync=none) is run to allow hidden-disk to buffer
any state that would otherwise be lost by the speculative write-through
of the NBD server into the secondary disk. So before block replication,
the primary disk and secondary disk should contain the same data.
@@ -156,15 +156,15 @@ Primary:
children.0.driver=raw
Run qmp command in primary qemu:
- { 'execute': 'human-monitor-command',
- 'arguments': {
- 'command-line': 'drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1'
+ { "execute": "human-monitor-command",
+ "arguments": {
+ "command-line": "drive_add -n buddy driver=replication,mode=primary,file.driver=nbd,file.host=xxxx,file.port=xxxx,file.export=colo1,node-name=nbd_client1"
}
}
- { 'execute': 'x-blockdev-change',
- 'arguments': {
- 'parent': 'colo1',
- 'node': 'nbd_client1'
+ { "execute": "x-blockdev-change",
+ "arguments": {
+ "parent": "colo1",
+ "node": "nbd_client1"
}
}
Note:
@@ -179,7 +179,7 @@ Primary:
Secondary:
-drive if=none,driver=raw,file.filename=1.raw,id=colo1 \
- -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=childs1
+ -drive if=none,id=childs1,driver=replication,mode=secondary,top-id=top-disk1
file.file.filename=active_disk.qcow2,\
file.driver=qcow2,\
file.backing.file.filename=hidden_disk.qcow2,\
@@ -189,21 +189,21 @@ Secondary:
vote-threshold=1,children.0=childs1
Then run qmp command in secondary qemu:
- { 'execute': 'nbd-server-start',
- 'arguments': {
- 'addr': {
- 'type': 'inet',
- 'data': {
- 'host': 'xxx',
- 'port': 'xxx'
+ { "execute": "nbd-server-start",
+ "arguments": {
+ "addr": {
+ "type": "inet",
+ "data": {
+ "host": "xxx",
+ "port": "xxx"
}
}
}
}
- { 'execute': 'nbd-server-add',
- 'arguments': {
- 'device': 'colo1',
- 'writable': true
+ { "execute": "nbd-server-add",
+ "arguments": {
+ "device": "colo1",
+ "writable": true
}
}
@@ -223,22 +223,22 @@ After Failover:
Primary:
The secondary host is down, so we should run the following qmp command
to remove the nbd child from the quorum:
- { 'execute': 'x-blockdev-change',
- 'arguments': {
- 'parent': 'colo1',
- 'child': 'children.1'
+ { "execute": "x-blockdev-change",
+ "arguments": {
+ "parent": "colo1",
+ "child": "children.1"
}
}
- { 'execute': 'human-monitor-command',
- 'arguments': {
- 'command-line': 'drive_del xxxx'
+ { "execute": "human-monitor-command",
+ "arguments": {
+ "command-line": "drive_del xxxx"
}
}
Note: there is no qmp command to remove the blockdev now
Secondary:
The primary host is down, so we should do the following thing:
- { 'execute': 'nbd-server-stop' }
+ { "execute": "nbd-server-stop" }
Promote Secondary to Primary:
see COLO-FT.txt
diff --git a/docs/ccid.txt b/docs/ccid.txt
deleted file mode 100644
index 2b85b1bd42..0000000000
--- a/docs/ccid.txt
+++ /dev/null
@@ -1,182 +0,0 @@
-QEMU CCID Device Documentation.
-
-Contents
-1. USB CCID device
-2. Building
-3. Using ccid-card-emulated with hardware
-4. Using ccid-card-emulated with certificates
-5. Using ccid-card-passthru with client side hardware
-6. Using ccid-card-passthru with client side certificates
-7. Passthrough protocol scenario
-8. libcacard
-
-1. USB CCID device
-
-The USB CCID device is a USB device implementing the CCID specification, which
-lets one connect smart card readers that implement the same spec. For more
-information see the specification:
-
- Universal Serial Bus
- Device Class: Smart Card
- CCID
- Specification for
- Integrated Circuit(s) Cards Interface Devices
- Revision 1.1
- April 22rd, 2005
-
-Smartcards are used for authentication, single sign on, decryption in
-public/private schemes and digital signatures. A smartcard reader on the client
-cannot be used on a guest with simple usb passthrough since it will then not be
-available on the client, possibly locking the computer when it is "removed". On
-the other hand this device can let you use the smartcard on both the client and
-the guest machine. It is also possible to have a completely virtual smart card
-reader and smart card (i.e. not backed by a physical device) using this device.
-
-2. Building
-
-The cryptographic functions and access to the physical card is done via the
-libcacard library, whose development package must be installed prior to
-building QEMU:
-
-In redhat/fedora:
- yum install libcacard-devel
-In ubuntu:
- apt-get install libcacard-dev
-
-Configuring and building:
- ./configure --enable-smartcard && make
-
-
-3. Using ccid-card-emulated with hardware
-
-Assuming you have a working smartcard on the host with the current
-user, using libcacard, QEMU acts as another client using ccid-card-emulated:
-
- qemu -usb -device usb-ccid -device ccid-card-emulated
-
-
-4. Using ccid-card-emulated with certificates stored in files
-
-You must create the CA and card certificates. This is a one time process.
-We use NSS certificates:
-
- mkdir fake-smartcard
- cd fake-smartcard
- certutil -N -d sql:$PWD
- certutil -S -d sql:$PWD -s "CN=Fake Smart Card CA" -x -t TC,TC,TC -n fake-smartcard-ca
- certutil -S -d sql:$PWD -t ,, -s "CN=John Doe" -n id-cert -c fake-smartcard-ca
- certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (signing)" --nsCertType smime -n signing-cert -c fake-smartcard-ca
- certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (encryption)" --nsCertType sslClient -n encryption-cert -c fake-smartcard-ca
-
-Note: you must have exactly three certificates.
-
-You can use the emulated card type with the certificates backend:
-
- qemu -usb -device usb-ccid -device ccid-card-emulated,backend=certificates,db=sql:$PWD,cert1=id-cert,cert2=signing-cert,cert3=encryption-cert
-
-To use the certificates in the guest, export the CA certificate:
-
- certutil -L -r -d sql:$PWD -o fake-smartcard-ca.cer -n fake-smartcard-ca
-
-and import it in the guest:
-
- certutil -A -d /etc/pki/nssdb -i fake-smartcard-ca.cer -t TC,TC,TC -n fake-smartcard-ca
-
-In a Linux guest you can then use the CoolKey PKCS #11 module to access
-the card:
-
- certutil -d /etc/pki/nssdb -L -h all
-
-It will prompt you for the PIN (which is the password you assigned to the
-certificate database early on), and then show you all three certificates
-together with the manually imported CA cert:
-
- Certificate Nickname Trust Attributes
- fake-smartcard-ca CT,C,C
- John Doe:CAC ID Certificate u,u,u
- John Doe:CAC Email Signature Certificate u,u,u
- John Doe:CAC Email Encryption Certificate u,u,u
-
-If this does not happen, CoolKey is not installed or not registered with
-NSS. Registration can be done from Firefox or the command line:
-
- modutil -dbdir /etc/pki/nssdb -add "CAC Module" -libfile /usr/lib64/pkcs11/libcoolkeypk11.so
- modutil -dbdir /etc/pki/nssdb -list
-
-
-5. Using ccid-card-passthru with client side hardware
-
-on the host specify the ccid-card-passthru device with a suitable chardev:
-
- qemu -chardev socket,server=on,host=0.0.0.0,port=2001,id=ccid,wait=off \
- -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid
-
-on the client run vscclient, built when you built QEMU:
-
- vscclient <qemu-host> 2001
-
-
-6. Using ccid-card-passthru with client side certificates
-
-This case is not particularly useful, but you can use it to debug
-your setup if #4 works but #5 does not.
-
-Follow instructions as per #4, except run QEMU and vscclient as follows:
-Run qemu as per #5, and run vscclient from the "fake-smartcard"
-directory as follows:
-
- qemu -chardev socket,server=on,host=0.0.0.0,port=2001,id=ccid,wait=off \
- -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid
- vscclient -e "db=\"sql:$PWD\" use_hw=no soft=(,Test,CAC,,id-cert,signing-cert,encryption-cert)" <qemu-host> 2001
-
-
-7. Passthrough protocol scenario
-
-This is a typical interchange of messages when using the passthru card device.
-usb-ccid is a usb device. It defaults to an unattached usb device on startup.
-usb-ccid expects a chardev and expects the protocol defined in
-cac_card/vscard_common.h to be passed over that.
-The usb-ccid device can be in one of three modes:
- * detached
- * attached with no card
- * attached with card
-
-A typical interchange is: (the arrow shows who started each exchange, it can be client
-originated or guest originated)
-
-client event | vscclient | passthru | usb-ccid | guest event
-----------------------------------------------------------------------------------------------
- | VSC_Init | | |
- | VSC_ReaderAdd | | attach |
- | | | | sees new usb device.
-card inserted -> | | | |
- | VSC_ATR | insert | insert | see new card
- | | | |
- | VSC_APDU | VSC_APDU | | <- guest sends APDU
-client<->physical | | | |
-card APDU exchange| | | |
-client response ->| VSC_APDU | VSC_APDU | | receive APDU response
- ...
- [APDU<->APDU repeats several times]
- ...
-card removed -> | | | |
- | VSC_CardRemove | remove | remove | card removed
- ...
- [(card insert, apdu's, card remove) repeat]
- ...
-kill/quit | | | |
- vscclient | | | |
- | VSC_ReaderRemove | | detach |
- | | | | usb device removed.
-
-
-8. libcacard
-
-Both ccid-card-emulated and vscclient use libcacard as the card emulator.
-libcacard implements a completely virtual CAC (DoD standard for smart
-cards) compliant card and uses NSS to retrieve certificates and do
-any encryption. The backend can then be a real reader and card, or
-certificates stored in files.
-
-For documentation of the library see docs/libcacard.txt.
-
diff --git a/docs/colo-proxy.txt b/docs/colo-proxy.txt
index 1fc38aed1b..e712c883db 100644
--- a/docs/colo-proxy.txt
+++ b/docs/colo-proxy.txt
@@ -162,7 +162,7 @@ Here is an example using demonstration IP and port addresses to more
clearly describe the usage.
Primary(ip:3.3.3.3):
--netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,downscript=/etc/qemu-ifdown
+-netdev tap,id=hn0,vhost=off
-device e1000,id=e0,netdev=hn0,mac=52:a4:00:12:78:66
-chardev socket,id=mirror0,host=3.3.3.3,port=9003,server=on,wait=off
-chardev socket,id=compare1,host=3.3.3.3,port=9004,server=on,wait=off
@@ -177,7 +177,7 @@ Primary(ip:3.3.3.3):
-object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,outdev=compare_out0,iothread=iothread1
Secondary(ip:3.3.3.8):
--netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,down script=/etc/qemu-ifdown
+-netdev tap,id=hn0,vhost=off
-device e1000,netdev=hn0,mac=52:a4:00:12:78:66
-chardev socket,id=red0,host=3.3.3.3,port=9003
-chardev socket,id=red1,host=3.3.3.3,port=9004
@@ -202,7 +202,7 @@ Primary(ip:3.3.3.3):
-object colo-compare,id=comp0,primary_in=compare0-0,secondary_in=compare1,outdev=compare_out0,vnet_hdr_support
Secondary(ip:3.3.3.8):
--netdev tap,id=hn0,vhost=off,script=/etc/qemu-ifup,down script=/etc/qemu-ifdown
+-netdev tap,id=hn0,vhost=off
-device e1000,netdev=hn0,mac=52:a4:00:12:78:66
-chardev socket,id=red0,host=3.3.3.3,port=9003
-chardev socket,id=red1,host=3.3.3.3,port=9004
diff --git a/docs/conf.py b/docs/conf.py
index ff6e92c6e2..aae0304ac6 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -29,18 +29,8 @@
import os
import sys
import sphinx
-from distutils.version import LooseVersion
from sphinx.errors import ConfigError
-# Make Sphinx fail cleanly if using an old Python, rather than obscurely
-# failing because some code in one of our extensions doesn't work there.
-# In newer versions of Sphinx this will display nicely; in older versions
-# Sphinx will also produce a Python backtrace but at least the information
-# gets printed...
-if sys.version_info < (3,6):
- raise ConfigError(
- "QEMU requires a Sphinx that uses Python 3.6 or better\n")
-
# The per-manual conf.py will set qemu_docdir for a single-manual build;
# otherwise set it here if this is an entire-manual-set build.
# This is always the absolute path of the docs/ directory in the source tree.
@@ -73,8 +63,14 @@ needs_sphinx = '1.6'
# ones.
extensions = ['kerneldoc', 'qmp_lexer', 'hxtool', 'depfile', 'qapidoc']
+if sphinx.version_info[:3] > (4, 0, 0):
+ tags.add('sphinx4')
+ extensions += ['dbusdoc']
+else:
+ extensions += ['fakedbusdoc']
+
# Add any paths that contain templates here, relative to this directory.
-templates_path = ['_templates']
+templates_path = [os.path.join(qemu_docdir, '_templates')]
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
@@ -85,9 +81,14 @@ source_suffix = '.rst'
# The master toctree document.
master_doc = 'index'
+# Interpret `single-backticks` to be a cross-reference to any kind of
+# referenceable object. Unresolvable or ambiguous references will emit a
+# warning at build time.
+default_role = 'any'
+
# General information about the project.
project = u'QEMU'
-copyright = u'2021, The QEMU Project Developers'
+copyright = u'2024, The QEMU Project Developers'
author = u'The QEMU Project Developers'
# The version info for the project you're documenting, acts as replacement for
@@ -115,7 +116,7 @@ finally:
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
-language = None
+language = 'en'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
@@ -163,10 +164,10 @@ html_theme = 'sphinx_rtd_theme'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
-if LooseVersion(sphinx_rtd_theme.__version__) >= LooseVersion("0.4.3"):
- html_theme_options = {
- "style_nav_header_background": "#802400",
- }
+html_theme_options = {
+ "style_nav_header_background": "#802400",
+ "navigation_with_keys": True,
+}
html_logo = os.path.join(qemu_docdir, "../ui/icons/qemu_128x128.png")
@@ -181,6 +182,10 @@ html_css_files = [
'theme_overrides.css',
]
+html_js_files = [
+ 'custom.js',
+]
+
html_context = {
"display_gitlab": True,
"gitlab_user": "qemu-project",
@@ -274,26 +279,9 @@ man_pages = [
('tools/virtfs-proxy-helper', 'virtfs-proxy-helper',
'QEMU 9p virtfs proxy filesystem helper',
['M. Mohan Kumar'], 1),
- ('tools/virtiofsd', 'virtiofsd',
- 'QEMU virtio-fs shared file system daemon',
- ['Stefan Hajnoczi <stefanha@redhat.com>',
- 'Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>'], 1),
]
man_make_section_directory = False
-# -- Options for Texinfo output -------------------------------------------
-
-# Grouping the document tree into Texinfo files. List of tuples
-# (source start file, target name, title, author,
-# dir menu entry, description, category)
-texinfo_documents = [
- (master_doc, 'QEMU', u'QEMU Documentation',
- author, 'QEMU', 'One line description of project.',
- 'Miscellaneous'),
-]
-
-
-
# We use paths starting from qemu_docdir here so that you can run
# sphinx-build from anywhere and the kerneldoc extension can still
# find everything.
@@ -301,3 +289,5 @@ kerneldoc_bin = ['perl', os.path.join(qemu_docdir, '../scripts/kernel-doc')]
kerneldoc_srctree = os.path.join(qemu_docdir, '..')
hxtool_srctree = os.path.join(qemu_docdir, '..')
qapidoc_srctree = os.path.join(qemu_docdir, '..')
+dbusdoc_srctree = os.path.join(qemu_docdir, '..')
+dbus_index_common_prefix = ["org.qemu."]
diff --git a/docs/config/mach-virt-graphical.cfg b/docs/config/mach-virt-graphical.cfg
index d6d31b17f5..eba76eb198 100644
--- a/docs/config/mach-virt-graphical.cfg
+++ b/docs/config/mach-virt-graphical.cfg
@@ -56,9 +56,11 @@
[machine]
type = "virt"
- accel = "kvm"
gic-version = "host"
+[accel]
+ accel = "kvm"
+
[memory]
size = "1024"
diff --git a/docs/config/mach-virt-serial.cfg b/docs/config/mach-virt-serial.cfg
index 18a7c83731..324b0542ff 100644
--- a/docs/config/mach-virt-serial.cfg
+++ b/docs/config/mach-virt-serial.cfg
@@ -62,9 +62,11 @@
[machine]
type = "virt"
- accel = "kvm"
gic-version = "host"
+[accel]
+ accel = "kvm"
+
[memory]
size = "1024"
diff --git a/docs/config/q35-emulated.cfg b/docs/config/q35-emulated.cfg
index 99ac918e78..b4bd7e858a 100644
--- a/docs/config/q35-emulated.cfg
+++ b/docs/config/q35-emulated.cfg
@@ -61,6 +61,8 @@
[machine]
type = "q35"
+
+[accel]
accel = "kvm"
[memory]
@@ -286,3 +288,7 @@
driver = "hda-duplex"
bus = "ich9-hda-audio.0"
cad = "0"
+ audiodev = "audiodev0"
+
+[audiodev "audiodev0"]
+ driver = "none" # CHANGE ME
diff --git a/docs/config/q35-virtio-graphical.cfg b/docs/config/q35-virtio-graphical.cfg
index 4207f11e4f..820860aefe 100644
--- a/docs/config/q35-virtio-graphical.cfg
+++ b/docs/config/q35-virtio-graphical.cfg
@@ -55,6 +55,8 @@
[machine]
type = "q35"
+
+[accel]
accel = "kvm"
[memory]
@@ -246,3 +248,7 @@
driver = "hda-duplex"
bus = "sound.0"
cad = "0"
+ audiodev = "audiodev0"
+
+[audiodev "audiodev0"]
+ driver = "none" # CHANGE ME
diff --git a/docs/config/q35-virtio-serial.cfg b/docs/config/q35-virtio-serial.cfg
index d2830aec5e..023291390e 100644
--- a/docs/config/q35-virtio-serial.cfg
+++ b/docs/config/q35-virtio-serial.cfg
@@ -60,6 +60,8 @@
[machine]
type = "q35"
+
+[accel]
accel = "kvm"
[memory]
diff --git a/docs/devel/acpi-bits.rst b/docs/devel/acpi-bits.rst
new file mode 100644
index 0000000000..1ec394f5fb
--- /dev/null
+++ b/docs/devel/acpi-bits.rst
@@ -0,0 +1,167 @@
+=============================================================================
+ACPI/SMBIOS avocado tests using biosbits
+=============================================================================
+************
+Introduction
+************
+Biosbits is a software written by Josh Triplett that can be downloaded
+from https://biosbits.org/. The github codebase can be found
+`here <https://github.com/biosbits/bits/tree/master>`__. It is a software that
+executes the bios components such as acpi and smbios tables directly through
+acpica bios interpreter (a freely available C based library written by Intel,
+downloadable from https://acpica.org/ and is included with biosbits) without an
+operating system getting involved in between. Bios-bits has python integration
+with grub so actual routines that executes bios components can be written in
+python instead of bash-ish (grub's native scripting language).
+There are several advantages to directly testing the bios in a real physical
+machine or in a VM as opposed to indirectly discovering bios issues through the
+operating system (the OS). Operating systems tend to bypass bios problems and
+hide them from the end user. We have more control of what we wanted to test and
+how by being as close to the bios on a running system as possible without a
+complicated software component such as an operating system coming in between.
+Another issue is that we cannot exercise bios components such as ACPI and
+SMBIOS without being in the highest hardware privilege level, ring 0 for
+example in case of x86. Since the OS executes from ring 0 whereas normal user
+land software resides in unprivileged ring 3, operating system must be modified
+in order to write our test routines that exercise and test the bios. This is
+not possible in all cases. Lastly, test frameworks and routines are preferably
+written using a high level scripting language such as python. OSes and
+OS modules are generally written using low level languages such as C and
+low level assembly machine language. Writing test routines in a low level
+language makes things more cumbersome. These and other reasons makes using
+bios-bits very attractive for testing bioses. More details on the inspiration
+for developing biosbits and its real life uses can be found in [#a]_ and [#b]_.
+
+For QEMU, we maintain a fork of bios bits in gitlab along with all the
+dependent submodules `here <https://gitlab.com/qemu-project/biosbits-bits>`__.
+This fork contains numerous fixes, a newer acpica and changes specific to
+running this avocado QEMU tests using bits. The author of this document
+is the sole maintainer of the QEMU fork of bios bits repository. For more
+information, please see author's `FOSDEM talk on this bios-bits based test
+framework <https://fosdem.org/2024/schedule/event/fosdem-2024-2262-exercising-qemu-generated-acpi-smbios-tables-using-biosbits-from-within-a-guest-vm-/>`__.
+
+*********************************
+Description of the test framework
+*********************************
+
+Under the directory ``tests/avocado/``, ``acpi-bits.py`` is a QEMU avocado
+test that drives all this.
+
+A brief description of the various test files follows.
+
+Under ``tests/avocado/`` as the root we have:
+
+::
+
+ ├── acpi-bits
+ │ ├── bits-config
+ │ │ └── bits-cfg.txt
+ │ ├── bits-tests
+ │ ├── smbios.py2
+ │ ├── testacpi.py2
+ │ └── testcpuid.py2
+ ├── acpi-bits.py
+
+* ``tests/avocado``:
+
+ ``acpi-bits.py``:
+ This is the main python avocado test script that generates a
+ biosbits iso. It then spawns a QEMU VM with it, collects the log and reports
+ test failures. This is the script one would be interested in if they wanted
+ to add or change some component of the log parsing, add a new command line
+ to alter how QEMU is spawned etc. Test writers typically would not need to
+ modify this script unless they wanted to enhance or change the log parsing
+ for their tests. In order to enable debugging, you can set **V=1**
+ environment variable. This enables verbose mode for the test and also dumps
+ the entire log from bios bits and more information in case failure happens.
+ You can also set **BITS_DEBUG=1** to turn on debug mode. It will enable
+ verbose logs and also retain the temporary work directory the test used for
+ you to inspect and run the specific commands manually.
+
+ In order to run this test, please perform the following steps from the QEMU
+ build directory:
+ ::
+
+ $ make check-venv (needed only the first time to create the venv)
+ $ ./pyvenv/bin/avocado run -t acpi tests/avocado
+
+ The above will run all acpi avocado tests including this one.
+ In order to run the individual tests, perform the following:
+ ::
+
+ $ ./pyvenv/bin/avocado run tests/avocado/acpi-bits.py --tap -
+
+ The above will produce output in tap format. You can omit "--tap -" in the
+ end and it will produce output like the following:
+ ::
+
+ $ ./pyvenv/bin/avocado run tests/avocado/acpi-bits.py
+ Fetching asset from tests/avocado/acpi-bits.py:AcpiBitsTest.test_acpi_smbios_bits
+ JOB ID : eab225724da7b64c012c65705dc2fa14ab1defef
+ JOB LOG : /home/anisinha/avocado/job-results/job-2022-10-10T17.58-eab2257/job.log
+ (1/1) tests/avocado/acpi-bits.py:AcpiBitsTest.test_acpi_smbios_bits: PASS (33.09 s)
+ RESULTS : PASS 1 | ERROR 0 | FAIL 0 | SKIP 0 | WARN 0 | INTERRUPT 0 | CANCEL 0
+ JOB TIME : 39.22 s
+
+ You can inspect the log file for more information about the run or in order
+ to diagnoze issues. If you pass V=1 in the environment, more diagnostic logs
+ would be found in the test log.
+
+* ``tests/avocado/acpi-bits/bits-config``:
+
+ This location contains biosbits configuration files that determine how the
+ software runs the tests.
+
+ ``bits-config.txt``:
+ This is the biosbits config file that determines what tests
+ or actions are performed by bits. The description of the config options are
+ provided in the file itself.
+
+* ``tests/avocado/acpi-bits/bits-tests``:
+
+ This directory contains biosbits python based tests that are run from within
+ the biosbits environment in the spawned VM. New additions of test cases can
+ be made in the appropriate test file. For example, new acpi tests can go
+ into testacpi.py2 and one would call testsuite.add_test() to register the new
+ test so that it gets executed as a part of the ACPI tests.
+ It might be occasionally necessary to disable some subtests or add a new
+ test that belongs to a test suite not already present in this directory. To
+ do this, please clone the bits source from
+ https://gitlab.com/qemu-project/biosbits-bits/-/tree/qemu-bits.
+ Note that this is the "qemu-bits" branch and not the "bits" branch of the
+ repository. "qemu-bits" is the branch where we have made all the QEMU
+ specific enhancements and we must use the source from this branch only.
+ Copy the test suite/script that needs modification (addition of new tests
+ or disabling them) from python directory into this directory. For
+ example, in order to change cpuid related tests, copy the following
+ file into this directory and rename it with .py2 extension:
+ https://gitlab.com/qemu-project/biosbits-bits/-/blob/qemu-bits/python/testcpuid.py
+ Then make your additions and changes here. Therefore, the steps are:
+
+ (a) Copy unmodified test script to this directory from bits source.
+ (b) Add a SPDX license header.
+ (c) Perform modifications to the test.
+
+ Commits (a), (b) and (c) preferably should go under separate commits so that
+ the original test script and the changes we have made are separated and
+ clear. (a) and (b) can sometimes be combined into a single step.
+
+ The test framework will then use your modified test script to run the test.
+ No further changes would be needed. Please check the logs to make sure that
+ appropriate changes have taken effect.
+
+ The tests have an extension .py2 in order to indicate that:
+
+ (a) They are python2.7 based scripts and not python 3 scripts.
+ (b) They are run from within the bios bits VM and is not subjected to QEMU
+ build/test python script maintenance and dependency resolutions.
+ (c) They need not be loaded by avocado framework when running tests.
+
+
+Author: Ani Sinha <anisinha@redhat.com>
+
+References:
+-----------
+.. [#a] https://blog.linuxplumbersconf.org/2011/ocw/system/presentations/867/original/bits.pdf
+.. [#b] https://www.youtube.com/watch?v=36QIepyUuhg
+.. [#c] https://fosdem.org/2024/schedule/event/fosdem-2024-2262-exercising-qemu-generated-acpi-smbios-tables-using-biosbits-from-within-a-guest-vm-/
diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
index 52baa0736d..b77c6e13e1 100644
--- a/docs/devel/atomics.rst
+++ b/docs/devel/atomics.rst
@@ -1,3 +1,5 @@
+.. _atomics-ref:
+
=========================
Atomic operations in QEMU
=========================
@@ -25,7 +27,8 @@ provides macros that fall in three camps:
- weak atomic access and manual memory barriers: ``qatomic_read()``,
``qatomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
- ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``;
+ ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``,
+ ``smp_mb__before_rmw()``, ``smp_mb__after_rmw()``;
- sequentially consistent atomic access: everything else.
@@ -99,28 +102,10 @@ Similar operations return the new value of ``*ptr``::
typeof(*ptr) qatomic_or_fetch(ptr, val)
typeof(*ptr) qatomic_xor_fetch(ptr, val)
-``qemu/atomic.h`` also provides loads and stores that cannot be reordered
-with each other::
-
- typeof(*ptr) qatomic_mb_read(ptr)
- void qatomic_mb_set(ptr, val)
-
-However these do not provide sequential consistency and, in particular,
-they do not participate in the total ordering enforced by
-sequentially-consistent operations. For this reason they are deprecated.
-They should instead be replaced with any of the following (ordered from
-easiest to hardest):
-
-- accesses inside a mutex or spinlock
+``qemu/atomic.h`` also provides an optimized shortcut for
+``qatomic_set`` followed by ``smp_mb``::
-- lightweight synchronization primitives such as ``QemuEvent``
-
-- RCU operations (``qatomic_rcu_read``, ``qatomic_rcu_set``) when publishing
- or accessing a new version of a data structure
-
-- other atomic accesses: ``qatomic_read`` and ``qatomic_load_acquire`` for
- loads, ``qatomic_set`` and ``qatomic_store_release`` for stores, ``smp_mb``
- to forbid reordering subsequent loads before a store.
+ void qatomic_set_mb(ptr, val)
Weak atomic access and manual memory barriers
@@ -134,7 +119,7 @@ The only guarantees that you can rely upon in this case are:
ordinary accesses instead cause data races if they are concurrent with
other accesses of which at least one is a write. In order to ensure this,
the compiler will not optimize accesses out of existence, create unsolicited
- accesses, or perform other similar optimzations.
+ accesses, or perform other similar optimizations.
- acquire operations will appear to happen, with respect to the other
components of the system, before all the LOAD or STORE operations
@@ -217,10 +202,9 @@ They come in six kinds:
retrieves the address to which the second load will be directed),
the processor will guarantee that the first LOAD will appear to happen
before the second with respect to the other components of the system.
- However, this is not always true---for example, it was not true on
- Alpha processors. Whenever this kind of access happens to shared
- memory (that is not protected by a lock), a read barrier is needed,
- and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``.
+ Therefore, unlike ``smp_rmb()`` or ``qatomic_load_acquire()``,
+ ``smp_read_barrier_depends()`` can be just a compiler barrier on
+ weakly-ordered architectures such as Arm or PPC[#]_.
Note that the first load really has to have a _data_ dependency and not
a control dependency. If the address for the second load is dependent
@@ -228,6 +212,10 @@ They come in six kinds:
than actually loading the address itself, then it's a _control_
dependency and a full read barrier or better is required.
+.. [#] The DEC Alpha is an exception, because ``smp_read_barrier_depends()``
+ needs a processor barrier. On strongly-ordered architectures such
+ as x86 or s390, ``smp_rmb()`` and ``qatomic_load_acquire()`` can
+ also be compiler barriers only.
Memory barriers and ``qatomic_load_acquire``/``qatomic_store_release`` are
mostly used when a data structure has one thread that is always a writer
@@ -466,13 +454,19 @@ and memory barriers, and the equivalents in QEMU:
In QEMU, the second kind is named ``atomic_OP_fetch``.
- different atomic read-modify-write operations in Linux imply
- a different set of memory barriers; in QEMU, all of them enforce
- sequential consistency.
+ a different set of memory barriers. In QEMU, all of them enforce
+ sequential consistency: there is a single order in which the
+ program sees them happen.
-- in QEMU, ``qatomic_read()`` and ``qatomic_set()`` do not participate in
- the total ordering enforced by sequentially-consistent operations.
- This is because QEMU uses the C11 memory model. The following example
- is correct in Linux but not in QEMU:
+- however, according to the C11 memory model that QEMU uses, this order
+ does not propagate to other memory accesses on either side of the
+ read-modify-write operation. As far as those are concerned, the
+ operation consist of just a load-acquire followed by a store-release.
+ Stores that precede the RMW operation, and loads that follow it, can
+ still be reordered and will happen *in the middle* of the read-modify-write
+ operation!
+
+ Therefore, the following example is correct in Linux but not in QEMU:
+----------------------------------+--------------------------------+
| Linux (correct) | QEMU (incorrect) |
@@ -486,9 +480,24 @@ and memory barriers, and the equivalents in QEMU:
because the read of ``y`` can be moved (by either the processor or the
compiler) before the write of ``x``.
- Fixing this requires an ``smp_mb()`` memory barrier between the write
- of ``x`` and the read of ``y``. In the common case where only one thread
- writes ``x``, it is also possible to write it like this:
+ Fixing this requires a full memory barrier between the write of ``x`` and
+ the read of ``y``. QEMU provides ``smp_mb__before_rmw()`` and
+ ``smp_mb__after_rmw()``; they act both as an optimization,
+ avoiding the memory barrier on processors where it is unnecessary,
+ and as a clarification of this corner case of the C11 memory model:
+
+ +--------------------------------+
+ | QEMU (correct) |
+ +================================+
+ | :: |
+ | |
+ | a = qatomic_fetch_add(&x, 2);|
+ | smp_mb__after_rmw(); |
+ | b = qatomic_read(&y); |
+ +--------------------------------+
+
+ In the common case where only one thread writes ``x``, it is also possible
+ to write it like this:
+--------------------------------+
| QEMU (correct) |
@@ -496,8 +505,7 @@ and memory barriers, and the equivalents in QEMU:
| :: |
| |
| a = qatomic_read(&x); |
- | qatomic_set(&x, a + 2); |
- | smp_mb(); |
+ | qatomic_set_mb(&x, a + 2); |
| b = qatomic_read(&y); |
+--------------------------------+
diff --git a/docs/devel/block-coroutine-wrapper.rst b/docs/devel/block-coroutine-wrapper.rst
index 412851986b..6dd2cdcab3 100644
--- a/docs/devel/block-coroutine-wrapper.rst
+++ b/docs/devel/block-coroutine-wrapper.rst
@@ -26,12 +26,12 @@ called ``bdrv_foo(<same args>)``. In this case the script can help. To
trigger the generation:
1. You need ``bdrv_foo`` declaration somewhere (for example, in
- ``block/coroutines.h``) with the ``generated_co_wrapper`` mark,
+ ``block/coroutines.h``) with the ``co_wrapper`` mark,
like this:
.. code-block:: c
- int generated_co_wrapper bdrv_foo(<some args>);
+ int co_wrapper bdrv_foo(<some args>);
2. You need to feed this declaration to block-coroutine-wrapper script.
For this, add the .h (or .c) file with the declaration to the
@@ -46,7 +46,7 @@ Links
1. The script location is ``scripts/block-coroutine-wrapper.py``.
-2. Generic place for private ``generated_co_wrapper`` declarations is
+2. Generic place for private ``co_wrapper`` declarations is
``block/coroutines.h``, for public declarations:
``include/block/block.h``
diff --git a/docs/devel/build-system.rst b/docs/devel/build-system.rst
index 3baec158f2..09caf2f8e1 100644
--- a/docs/devel/build-system.rst
+++ b/docs/devel/build-system.rst
@@ -4,30 +4,14 @@ The QEMU build system architecture
This document aims to help developers understand the architecture of the
QEMU build system. As with projects using GNU autotools, the QEMU build
-system has two stages, first the developer runs the "configure" script
+system has two stages; first the developer runs the "configure" script
to determine the local build environment characteristics, then they run
-"make" to build the project. There is about where the similarities with
+"make" to build the project. This is about where the similarities with
GNU autotools end, so try to forget what you know about them.
+The two general ways to perform a build are as follows:
-Stage 1: configure
-==================
-
-The QEMU configure script is written directly in shell, and should be
-compatible with any POSIX shell, hence it uses #!/bin/sh. An important
-implication of this is that it is important to avoid using bash-isms on
-development platforms where bash is the primary host.
-
-In contrast to autoconf scripts, QEMU's configure is expected to be
-silent while it is checking for features. It will only display output
-when an error occurs, or to show the final feature enablement summary
-on completion.
-
-Because QEMU uses the Meson build system under the hood, only VPATH
-builds are supported. There are two general ways to invoke configure &
-perform a build:
-
- - VPATH, build artifacts outside of QEMU source tree entirely::
+ - build artifacts outside of QEMU source tree entirely::
cd ../
mkdir build
@@ -35,155 +19,201 @@ perform a build:
../qemu/configure
make
- - VPATH, build artifacts in a subdir of QEMU source tree::
+ - build artifacts in a subdir of QEMU source tree::
mkdir build
cd build
../configure
make
-For now, checks on the compilation environment are found in configure
-rather than meson.build, though this is expected to change. The command
-line is parsed in the configure script and, whenever needed, converted
-into the appropriate options to Meson.
-
-New checks should be added to Meson, which usually comprises the
-following tasks:
-
- - Add a Meson build option to meson_options.txt.
-
- - Add support to the command line arg parser to handle any new
- ``--enable-XXX``/``--disable-XXX`` flags required by the feature.
-
- - Add information to the help output message to report on the new
- feature flag.
-
- - Add code to perform the actual feature check.
-
- - Add code to include the feature status in ``config-host.h``
-
- - Add code to print out the feature status in the configure summary
- upon completion.
+Most of the actual build process uses Meson under the hood, therefore
+build artifacts cannot be placed in the source tree itself.
-Taking the probe for SDL2_Image as an example, we have the following pieces
-in configure::
-
- # Initial variable state
- sdl_image=auto
+Stage 1: configure
+==================
- ..snip..
+The configure script has five tasks:
- # Configure flag processing
- --disable-sdl-image) sdl_image=disabled
- ;;
- --enable-sdl-image) sdl_image=enabled
- ;;
+ - detect the host architecture
- ..snip..
+ - list the targets for which to build emulators; the list of
+ targets also affects which firmware binaries and tests to build
- # Help output feature message
- sdl-image SDL Image support for icons
+ - find the compilers (native and cross) used to build executables,
+ firmware and tests. The results are written as either Makefile
+ fragments (``config-host.mak``) or a Meson machine file
+ (``config-meson.cross``)
- ..snip..
+ - create a virtual environment in which all Python code runs during
+ the build, and possibly install packages into it from PyPI
- # Meson invocation
- -Dsdl_image=$sdl_image
+ - invoke Meson in the virtual environment, to perform the actual
+ configuration step for the emulator build
-In meson_options.txt::
+The configure script automatically recognizes command line options for
+which a same-named Meson option exists; dashes in the command line are
+replaced with underscores.
- option('sdl', type : 'feature', value : 'auto',
- description: 'SDL Image support for icons')
+Almost all QEMU developers that need to modify the build system will
+only be concerned with Meson, and therefore can skip the rest of this
+section.
-In meson.build::
- # Detect dependency
- sdl_image = dependency('SDL2_image', required: get_option('sdl_image'),
- method: 'pkg-config',
- kwargs: static_kwargs)
+Modifying ``configure``
+-----------------------
- # Create config-host.h (if applicable)
- config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
+``configure`` is a shell script; it uses ``#!/bin/sh`` and therefore
+should be compatible with any POSIX shell. It is important to avoid
+using bash-isms to avoid breaking development platforms where bash is
+the primary host.
- # Summary
- summary_info += {'SDL image support': sdl_image.found()}
+The configure script provides a variety of functions to help writing
+portable shell code and providing consistent behavior across architectures
+and operating systems:
+``error_exit $MESSAGE $MORE...``
+ Print $MESSAGE to stderr, followed by $MORE... and then exit from the
+ configure script with non-zero status.
+``has $COMMAND``
+ Determine if $COMMAND exists in the current environment, either as a
+ shell builtin, or executable binary, returning 0 on success. The
+ replacement in Meson is ``find_program()``.
-Helper functions
-----------------
+``probe_target_compiler $TARGET``
+ Detect a cross compiler and cross tools for the QEMU target $TARGET (e.g.,
+ ``$CPU-softmmu``, ``$CPU-linux-user``, ``$CPU-bsd-user``). If a working
+ compiler is present, return success and set variables ``$target_cc``,
+ ``$target_ar``, etc. to non-empty values.
-The configure script provides a variety of helper functions to assist
-developers in checking for system features:
+``write_target_makefile``
+ Write a Makefile fragment to stdout, exposing the result of the most
+ ``probe_target_compiler`` call as the usual Make variables (``CC``,
+ ``AR``, ``LD``, etc.).
-``do_cc $ARGS...``
- Attempt to run the system C compiler passing it $ARGS...
-``do_cxx $ARGS...``
- Attempt to run the system C++ compiler passing it $ARGS...
+Configure does not generally perform tests for compiler options beyond
+basic checks to detect the host platform and ensure the compiler is
+functioning. These are performed using a few more helper functions:
``compile_object $CFLAGS``
Attempt to compile a test program with the system C compiler using
$CFLAGS. The test program must have been previously written to a file
- called $TMPC. The replacement in Meson is the compiler object ``cc``,
- which has methods such as ``cc.compiles()``,
- ``cc.check_header()``, ``cc.has_function()``.
+ called $TMPC.
``compile_prog $CFLAGS $LDFLAGS``
Attempt to compile a test program with the system C compiler using
$CFLAGS and link it with the system linker using $LDFLAGS. The test
program must have been previously written to a file called $TMPC.
- The replacement in Meson is ``cc.find_library()`` and ``cc.links()``.
-
-``has $COMMAND``
- Determine if $COMMAND exists in the current environment, either as a
- shell builtin, or executable binary, returning 0 on success. The
- replacement in Meson is ``find_program()``.
``check_define $NAME``
- Determine if the macro $NAME is defined by the system C compiler
+ Determine if the macro $NAME is defined by the system C compiler.
-``check_include $NAME``
- Determine if the include $NAME file is available to the system C
- compiler. The replacement in Meson is ``cc.has_header()``.
+``do_compiler $CC $ARGS...``
+ Attempt to run the C compiler $CC, passing it $ARGS... This function
+ does not use flags passed via options such as ``--extra-cflags``, and
+ therefore can be used to check for cross compilers. However, most
+ such checks are done at ``make`` time instead (see for example the
+ ``cc-option`` macro in ``pc-bios/option-rom/Makefile``).
``write_c_skeleton``
Write a minimal C program main() function to the temporary file
- indicated by $TMPC
-
-``feature_not_found $NAME $REMEDY``
- Print a message to stderr that the feature $NAME was not available
- on the system, suggesting the user try $REMEDY to address the
- problem.
-
-``error_exit $MESSAGE $MORE...``
- Print $MESSAGE to stderr, followed by $MORE... and then exit from the
- configure script with non-zero status
-
-``query_pkg_config $ARGS...``
- Run pkg-config passing it $ARGS. If QEMU is doing a static build,
- then --static will be automatically added to $ARGS
+ indicated by $TMPC.
+
+
+Python virtual environments and the build process
+-------------------------------------------------
+
+An important step in ``configure`` is to create a Python virtual
+environment (venv) during the configuration phase. The Python interpreter
+comes from the ``--python`` command line option, the ``$PYTHON`` variable
+from the environment, or the system PATH, in this order. The venv resides
+in the ``pyvenv`` directory in the build tree, and provides consistency
+in how the build process runs Python code.
+
+At this stage, ``configure`` also queries the chosen Python interpreter
+about QEMU's build dependencies. Note that the build process does *not*
+look for ``meson``, ``sphinx-build`` or ``avocado`` binaries in the PATH;
+likewise, there are no options such as ``--meson`` or ``--sphinx-build``.
+This avoids a potential mismatch, where Meson and Sphinx binaries on the
+PATH might operate in a different Python environment than the one chosen
+by the user during the build process. On the other hand, it introduces
+a potential source of confusion where the user installs a dependency but
+``configure`` is not able to find it. When this happens, the dependency
+was installed in the ``site-packages`` directory of another interpreter,
+or with the wrong ``pip`` program.
+
+If a package is available for the chosen interpreter, ``configure``
+prepares a small script that invokes it from the venv itself[#distlib]_.
+If not, ``configure`` can also optionally install dependencies in the
+virtual environment with ``pip``, either from wheels in ``python/wheels``
+or by downloading the package with PyPI. Downloading can be disabled with
+``--disable-download``; and anyway, it only happens when a ``configure``
+option (currently, only ``--enable-docs``) is explicitly enabled but
+the dependencies are not present[#pip]_.
+
+.. [#distlib] The scripts are created based on the package's metadata,
+ specifically the ``console_script`` entry points. This is the
+ same mechanism that ``pip`` uses when installing a package.
+ Currently, in all cases it would be possible to use ``python -m``
+ instead of an entry point script, which makes this approach a
+ bit overkill. On the other hand, creating the scripts is
+ future proof and it makes the contents of the ``pyvenv/bin``
+ directory more informative. Portability is also not an issue,
+ because the Python Packaging Authority provides a package
+ ``distlib.scripts`` to perform this task.
+
+.. [#pip] ``pip`` might also be used when running ``make check-avocado``
+ if downloading is enabled, to ensure that Avocado is
+ available.
+
+The required versions of the packages are stored in a configuration file
+``pythondeps.toml``. The format is custom to QEMU, but it is documented
+at the top of the file itself and it should be easy to understand. The
+requirements should make it possible to use the version that is packaged
+that is provided by supported distros.
+
+When dependencies are downloaded, instead, ``configure`` uses a "known
+good" version that is also listed in ``pythondeps.toml``. In this
+scenario, ``pythondeps.toml`` behaves like the "lock file" used by
+``cargo``, ``poetry`` or other dependency management systems.
+
+
+Bundled Python packages
+-----------------------
+
+Python packages that are **mandatory** dependencies to build QEMU,
+but are not available in all supported distros, are bundled with the
+QEMU sources. Currently this includes Meson (outdated in CentOS 8
+and derivatives, Ubuntu 20.04 and 22.04, and openSUSE Leap) and tomli
+(absent in Ubuntu 20.04).
+
+If you need to update these, please do so by modifying and rerunning
+``python/scripts/vendor.py``. This script embeds the sha256 hash of
+package sources and checks it. The pypi.org web site provides an easy
+way to retrieve the sha256 hash of the sources.
Stage 2: Meson
==============
-The Meson build system is currently used to describe the build
-process for:
+The Meson build system describes the build and install process for:
1) executables, which include:
- - Tools - qemu-img, qemu-nbd, qga (guest agent), etc
+ - Tools - ``qemu-img``, ``qemu-nbd``, ``qemu-ga`` (guest agent), etc
- - System emulators - qemu-system-$ARCH
+ - System emulators - ``qemu-system-$ARCH``
- - Userspace emulators - qemu-$ARCH
+ - Userspace emulators - ``qemu-$ARCH``
- Unit tests
2) documentation
-3) ROMs, which can be either installed as binary blobs or compiled
+3) ROMs, whether provided as binary blobs in the QEMU distributions
+ or cross compiled under the direction of the configure script
4) other data files, such as icons or desktop files
@@ -221,26 +251,11 @@ Target-independent emulator sourcesets:
This includes error handling infrastructure, standard data structures,
platform portability wrapper functions, etc.
- Target-independent code lives in the ``common_ss``, ``softmmu_ss`` and
+ Target-independent code lives in the ``common_ss``, ``system_ss`` and
``user_ss`` sourcesets. ``common_ss`` is linked into all emulators,
- ``softmmu_ss`` only in system emulators, ``user_ss`` only in user-mode
+ ``system_ss`` only in system emulators, ``user_ss`` only in user-mode
emulators.
- Target-independent sourcesets must exercise particular care when using
- ``if_false`` rules. The ``if_false`` rule will be used correctly when linking
- emulator binaries; however, when *compiling* target-independent files
- into .o files, Meson may need to pick *both* the ``if_true`` and
- ``if_false`` sides to cater for targets that want either side. To
- achieve that, you can add a special rule using the ``CONFIG_ALL``
- symbol::
-
- # Some targets have CONFIG_ACPI, some don't, so this is not enough
- softmmu_ss.add(when: 'CONFIG_ACPI', if_true: files('acpi.c'),
- if_false: files('acpi-stub.c'))
-
- # This is required as well:
- softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('acpi-stub.c'))
-
Target-dependent emulator sourcesets:
In the target-dependent set lives CPU emulation, some device emulation and
much glue code. This sometimes also has to be compiled multiple times,
@@ -263,20 +278,20 @@ Target-dependent emulator sourcesets:
The sourceset is only used for system emulators.
Each subdirectory in ``target/`` instead should add one sourceset to each
- of the ``target_arch`` and ``target_softmmu_arch``, which are used respectively
+ of the ``target_arch`` and ``target_system_arch``, which are used respectively
for all emulators and for system emulators only. For example::
arm_ss = ss.source_set()
- arm_softmmu_ss = ss.source_set()
+ arm_system_ss = ss.source_set()
...
target_arch += {'arm': arm_ss}
- target_softmmu_arch += {'arm': arm_softmmu_ss}
+ target_system_arch += {'arm': arm_system_ss}
Module sourcesets:
There are two dictionaries for modules: ``modules`` is used for
target-independent modules and ``target_modules`` is used for
target-dependent modules. When modules are disabled the ``module``
- source sets are added to ``softmmu_ss`` and the ``target_modules``
+ source sets are added to ``system_ss`` and the ``target_modules``
source sets are added to ``specific_ss``.
Both dictionaries are nested. One dictionary is created per
@@ -335,6 +350,58 @@ new target, or enabling new devices or hardware for a particular
system/userspace emulation target
+Adding checks
+-------------
+
+Compiler checks can be as simple as the following::
+
+ config_host_data.set('HAVE_BTRFS_H', cc.has_header('linux/btrfs.h'))
+
+A more complex task such as adding a new dependency usually
+comprises the following tasks:
+
+ - Add a Meson build option to meson_options.txt.
+
+ - Add code to perform the actual feature check.
+
+ - Add code to include the feature status in ``config-host.h``
+
+ - Add code to print out the feature status in the configure summary
+ upon completion.
+
+Taking the probe for SDL2_Image as an example, we have the following
+in ``meson_options.txt``::
+
+ option('sdl_image', type : 'feature', value : 'auto',
+ description: 'SDL Image support for icons')
+
+Unless the option was given a non-``auto`` value (on the configure
+command line), the detection code must be performed only if the
+dependency will be used::
+
+ sdl_image = not_found
+ if not get_option('sdl_image').auto() or have_system
+ sdl_image = dependency('SDL2_image', required: get_option('sdl_image'),
+ method: 'pkg-config')
+ endif
+
+This avoids warnings on static builds of user-mode emulators, for example.
+Most of the libraries used by system-mode emulators are not available for
+static linking.
+
+The other supporting code is generally simple::
+
+ # Create config-host.h (if applicable)
+ config_host_data.set('CONFIG_SDL_IMAGE', sdl_image.found())
+
+ # Summary
+ summary_info += {'SDL image support': sdl_image.found()}
+
+For the configure script to parse the new option, the
+``scripts/meson-buildoptions.sh`` file must be up-to-date; ``make
+update-buildoptions`` (or just ``make``) will take care of updating it.
+
+
Support scripts
---------------
@@ -362,23 +429,51 @@ This is needed to obey the --python= option passed to the configure
script, which may point to something other than the first python3
binary on the path.
+By the time Meson runs, Python dependencies are available in the virtual
+environment and should be invoked through the scripts that ``configure``
+places under ``pyvenv``. One way to do so is as follows, using Meson's
+``find_program`` function::
-Stage 3: makefiles
-==================
+ sphinx_build = find_program(
+ fs.parent(python.full_path()) / 'sphinx-build',
+ required: get_option('docs'))
+
+
+Stage 3: Make
+=============
-The use of GNU make is required with the QEMU build system.
+The next step in building QEMU is to invoke make. GNU Make is required
+to build QEMU, and may be installed as ``gmake`` on some hosts.
-The output of Meson is a build.ninja file, which is used with the Ninja
-build system. QEMU uses a different approach, where Makefile rules are
-synthesized from the build.ninja file. The main Makefile includes these
-rules and wraps them so that e.g. submodules are built before QEMU.
-The resulting build system is largely non-recursive in nature, in
-contrast to common practices seen with automake.
+The output of Meson is a ``build.ninja`` file, which is used with the
+Ninja build tool. However, QEMU's build comprises other components than
+just the emulators (namely firmware and the tests in ``tests/tcg``) which
+need different cross compilers. The QEMU Makefile wraps both Ninja and
+the smaller build systems for firmware and tests; it also takes care of
+running ``configure`` again when the script changes. Apart from invoking
+these sub-Makefiles, the resulting build is largely non-recursive.
+
+Tests, whether defined in ``meson.build`` or not, are also ran by the
+Makefile with the traditional ``make check`` phony target, while benchmarks
+are run with ``make bench``. Meson test suites such as ``unit`` can be ran
+with ``make check-unit``, and ``make check-tcg`` builds and runs "non-Meson"
+tests for all targets.
+
+If desired, it is also possible to use ``ninja`` and ``meson test``,
+respectively to build emulators and run tests defined in meson.build.
+The main difference is that ``make`` needs the ``-jN`` flag in order to
+enable parallel builds or tests.
+
+Useful make targets
+-------------------
+
+``help``
+ Print a help message for the most common build targets.
+
+``print-VAR``
+ Print the value of the variable VAR. Useful for debugging the build
+ system.
-Tests are also ran by the Makefile with the traditional ``make check``
-phony target, while benchmarks are run with ``make bench``. Meson test
-suites such as ``unit`` can be ran with ``make check-unit`` too. It is also
-possible to run tests defined in meson.build with ``meson test``.
Important files for the build system
====================================
@@ -393,8 +488,7 @@ number of dynamically created files listed later.
``Makefile``
The main entry point used when invoking make to build all the components
of QEMU. The default 'all' target will naturally result in the build of
- every component. Makefile takes care of recursively building submodules
- directly via a non-recursive set of rules.
+ every component.
``*/meson.build``
The meson.build file in the root directory is the main entry point for the
@@ -402,61 +496,94 @@ number of dynamically created files listed later.
executables. Build rules for various subdirectories are included in
other meson.build files spread throughout the QEMU source tree.
+``python/scripts/mkvenv.py``
+ A wrapper for the Python ``venv`` and ``distlib.scripts`` packages.
+ It handles creating the virtual environment, creating scripts in
+ ``pyvenv/bin``, and calling ``pip`` to install dependencies.
+
``tests/Makefile.include``
- Rules for external test harnesses. These include the TCG tests,
- ``qemu-iotests`` and the Avocado-based acceptance tests.
+ Rules for external test harnesses. These include the TCG tests
+ and the Avocado-based integration tests.
``tests/docker/Makefile.include``
- Rules for Docker tests. Like tests/Makefile, this file is included
- directly by the top level Makefile, anything defined in this file will
- influence the entire build system.
+ Rules for Docker tests. Like ``tests/Makefile.include``, this file is
+ included directly by the top level Makefile, anything defined in this
+ file will influence the entire build system.
``tests/vm/Makefile.include``
- Rules for VM-based tests. Like tests/Makefile, this file is included
- directly by the top level Makefile, anything defined in this file will
- influence the entire build system.
+ Rules for VM-based tests. Like ``tests/Makefile.include``, this file is
+ included directly by the top level Makefile, anything defined in this
+ file will influence the entire build system.
Dynamically created files
-------------------------
-The following files are generated dynamically by configure in order to
-control the behaviour of the statically defined makefiles. This avoids
-the need for QEMU makefiles to go through any pre-processing as seen
-with autotools, where Makefile.am generates Makefile.in which generates
-Makefile.
+The following files are generated at run-time in order to control the
+behaviour of the Makefiles. This avoids the need for QEMU makefiles to
+go through any pre-processing as seen with autotools, where configure
+generates ``Makefile`` from ``Makefile.in``.
Built by configure:
``config-host.mak``
When configure has determined the characteristics of the build host it
- will write a long list of variables to config-host.mak file. This
- provides the various install directories, compiler / linker flags and a
- variety of ``CONFIG_*`` variables related to optionally enabled features.
- This is imported by the top level Makefile and meson.build in order to
- tailor the build output.
+ will write the paths to various tools to this file, for use in ``Makefile``
+ and to a smaller extent ``meson.build``.
- config-host.mak is also used as a dependency checking mechanism. If make
+ ``config-host.mak`` is also used as a dependency checking mechanism. If make
sees that the modification timestamp on configure is newer than that on
- config-host.mak, then configure will be re-run.
+ ``config-host.mak``, then configure will be re-run.
+
+``config-meson.cross``
+
+ A Meson "cross file" (or native file) used to communicate the paths to
+ the toolchain and other configuration options.
+
+``config.status``
+
+ A small shell script that will invoke configure again with the same
+ environment variables that were set during the first run. It's used to
+ rerun configure after changes to the source code, but it can also be
+ inspected manually to check the contents of the environment.
+
+``Makefile.prereqs``
+
+ A set of Makefile dependencies that order the build and execution of
+ firmware and tests after the container images and emulators that they
+ need.
+
+``pc-bios/*/config.mak``, ``tests/tcg/config-host.mak``, ``tests/tcg/*/config-target.mak``
+
+ Configuration variables used to build the firmware and TCG tests,
+ including paths to cross compilation toolchains.
- The variables defined here are those which are applicable to all QEMU
- build outputs. Variables which are potentially different for each
- emulator target are defined by the next file...
+``pyvenv``
+ A Python virtual environment that is used for all Python code running
+ during the build. Using a virtual environment ensures that even code
+ that is run via ``sphinx-build``, ``meson`` etc. uses the same interpreter
+ and packages.
Built by Meson:
+``config-host.h``
+ Used by C code to determine the properties of the build environment
+ and the set of enabled features for the entire build.
+
``${TARGET-NAME}-config-devices.mak``
- TARGET-NAME is again the name of a system or userspace emulator. The
- config-devices.mak file is automatically generated by make using the
- scripts/make_device_config.sh program, feeding it the
- default-configs/$TARGET-NAME file as input.
+ TARGET-NAME is the name of a system emulator. The file is
+ generated by Meson using files under ``configs/devices`` as input.
+
+``${TARGET-NAME}-config-target.mak``
+ TARGET-NAME is the name of a system or usermode emulator. The file is
+ generated by Meson using files under ``configs/targets`` as input.
-``config-host.h``, ``$TARGET-NAME/config-target.h``, ``$TARGET-NAME/config-devices.h``
- These files are used by source code to determine what features
- are enabled. They are generated from the contents of the corresponding
- ``*.h`` files using the scripts/create_config program. This extracts
- relevant variables and formats them as C preprocessor macros.
+``$TARGET_NAME-config-target.h``, ``$TARGET_NAME-config-devices.h``
+ Used by C code to determine the properties and enabled
+ features for each target. enabled. They are generated from
+ the contents of the corresponding ``*.mak`` files using Meson's
+ ``configure_file()`` function; each target can include them using
+ the ``CONFIG_TARGET`` and ``CONFIG_DEVICES`` macro respectively.
``build.ninja``
The build rules.
@@ -473,14 +600,3 @@ Built by Makefile:
meson.build. The rules are produced from Meson's JSON description of
tests (obtained with "meson introspect --tests") through the script
scripts/mtest2make.py.
-
-
-Useful make targets
--------------------
-
-``help``
- Print a help message for the most common build targets.
-
-``print-VAR``
- Print the value of the variable VAR. Useful for debugging the build
- system.
diff --git a/docs/devel/ci-definitions.rst b/docs/devel/ci-definitions.rst.inc
index 32e22ff468..6d5c6fd9f2 100644
--- a/docs/devel/ci-definitions.rst
+++ b/docs/devel/ci-definitions.rst.inc
@@ -59,7 +59,7 @@ to system testing [5]_. Note that, in some cases, system testing may require
interaction with third-party software, like operating system images, databases,
networks, and so on.
-On QEMU, system testing is represented by the 'check-acceptance' target from
+On QEMU, system testing is represented by the 'check-avocado' target from
'make'.
Flaky tests
diff --git a/docs/devel/ci-jobs.rst b/docs/devel/ci-jobs.rst
deleted file mode 100644
index 277975e4ad..0000000000
--- a/docs/devel/ci-jobs.rst
+++ /dev/null
@@ -1,51 +0,0 @@
-Custom CI/CD variables
-======================
-
-QEMU CI pipelines can be tuned by setting some CI environment variables.
-
-Set variable globally in the user's CI namespace
-------------------------------------------------
-
-Variables can be set globally in the user's CI namespace setting.
-
-For further information about how to set these variables, please refer to::
-
- https://docs.gitlab.com/ee/ci/variables/#add-a-cicd-variable-to-a-project
-
-Set variable manually when pushing a branch or tag to the user's repository
----------------------------------------------------------------------------
-
-Variables can be set manually when pushing a branch or tag, using
-git-push command line arguments.
-
-Example setting the QEMU_CI_EXAMPLE_VAR variable:
-
-.. code::
-
- git push -o ci.variable="QEMU_CI_EXAMPLE_VAR=value" myrepo mybranch
-
-For further information about how to set these variables, please refer to::
-
- https://docs.gitlab.com/ee/user/project/push_options.html#push-options-for-gitlab-cicd
-
-Here is a list of the most used variables:
-
-QEMU_CI_AVOCADO_TESTING
-~~~~~~~~~~~~~~~~~~~~~~~
-By default, tests using the Avocado framework are not run automatically in
-the pipelines (because multiple artifacts have to be downloaded, and if
-these artifacts are not already cached, downloading them make the jobs
-reach the timeout limit). Set this variable to have the tests using the
-Avocado framework run automatically.
-
-AARCH64_RUNNER_AVAILABLE
-~~~~~~~~~~~~~~~~~~~~~~~~
-If you've got access to an aarch64 host that can be used as a gitlab-CI
-runner, you can set this variable to enable the tests that require this
-kind of host. The runner should be tagged with "aarch64".
-
-S390X_RUNNER_AVAILABLE
-~~~~~~~~~~~~~~~~~~~~~~
-If you've got access to an IBM Z host that can be used as a gitlab-CI
-runner, you can set this variable to enable the tests that require this
-kind of host. The runner should be tagged with "s390x".
diff --git a/docs/devel/ci-jobs.rst.inc b/docs/devel/ci-jobs.rst.inc
new file mode 100644
index 0000000000..be06322279
--- /dev/null
+++ b/docs/devel/ci-jobs.rst.inc
@@ -0,0 +1,197 @@
+.. _ci_var:
+
+Custom CI/CD variables
+======================
+
+QEMU CI pipelines can be tuned by setting some CI environment variables.
+
+Set variable globally in the user's CI namespace
+------------------------------------------------
+
+Variables can be set globally in the user's CI namespace setting.
+
+For further information about how to set these variables, please refer to::
+
+ https://docs.gitlab.com/ee/ci/variables/#add-a-cicd-variable-to-a-project
+
+Set variable manually when pushing a branch or tag to the user's repository
+---------------------------------------------------------------------------
+
+Variables can be set manually when pushing a branch or tag, using
+git-push command line arguments.
+
+Example setting the QEMU_CI_EXAMPLE_VAR variable:
+
+.. code::
+
+ git push -o ci.variable="QEMU_CI_EXAMPLE_VAR=value" myrepo mybranch
+
+For further information about how to set these variables, please refer to::
+
+ https://docs.gitlab.com/ee/user/project/push_options.html#push-options-for-gitlab-cicd
+
+Setting aliases in your git config
+----------------------------------
+
+You can use aliases to make it easier to push branches with different
+CI configurations. For example define an alias for triggering CI:
+
+.. code::
+
+ git config --local alias.push-ci "push -o ci.variable=QEMU_CI=1"
+ git config --local alias.push-ci-now "push -o ci.variable=QEMU_CI=2"
+
+Which lets you run:
+
+.. code::
+
+ git push-ci
+
+to create the pipeline, or:
+
+.. code::
+
+ git push-ci-now
+
+to create and run the pipeline
+
+
+Variable naming and grouping
+----------------------------
+
+The variables used by QEMU's CI configuration are grouped together
+in a handful of namespaces
+
+ * QEMU_JOB_nnnn - variables to be defined in individual jobs
+ or templates, to influence the shared rules defined in the
+ .base_job_template.
+
+ * QEMU_CI_nnn - variables to be set by contributors in their
+ repository CI settings, or as git push variables, to influence
+ which jobs get run in a pipeline
+
+ * QEMU_CI_CONTAINER_TAG - the tag used to publish containers
+ in stage 1, for use by build jobs in stage 2. Defaults to
+ 'latest', but if running pipelines for different branches
+ concurrently, it should be overridden per pipeline.
+
+ * QEMU_CI_UPSTREAM - gitlab namespace that is considered to be
+ the 'upstream'. This defaults to 'qemu-project'. Contributors
+ may choose to override this if they are modifying rules in
+ base.yml and need to validate how they will operate when in
+ an upstream context, as opposed to their fork context.
+
+ * nnn - other misc variables not falling into the above
+ categories, or using different names for historical reasons
+ and not yet converted.
+
+Maintainer controlled job variables
+-----------------------------------
+
+The following variables may be set when defining a job in the
+CI configuration file.
+
+QEMU_JOB_CIRRUS
+~~~~~~~~~~~~~~~
+
+The job makes use of Cirrus CI infrastructure, requiring the
+configuration setup for cirrus-run to be present in the repository
+
+QEMU_JOB_OPTIONAL
+~~~~~~~~~~~~~~~~~
+
+The job is expected to be successful in general, but is not run
+by default due to need to conserve limited CI resources. It is
+available to be started manually by the contributor in the CI
+pipelines UI.
+
+QEMU_JOB_ONLY_FORKS
+~~~~~~~~~~~~~~~~~~~
+
+The job results are only of interest to contributors prior to
+submitting code. They are not required as part of the gating
+CI pipeline.
+
+QEMU_JOB_SKIPPED
+~~~~~~~~~~~~~~~~
+
+The job is not reliably successful in general, so is not
+currently suitable to be run by default. Ideally this should
+be a temporary marker until the problems can be addressed, or
+the job permanently removed.
+
+QEMU_JOB_PUBLISH
+~~~~~~~~~~~~~~~~
+
+The job is for publishing content after a branch has been
+merged into the upstream default branch.
+
+QEMU_JOB_AVOCADO
+~~~~~~~~~~~~~~~~
+
+The job runs the Avocado integration test suite
+
+Contributor controlled runtime variables
+----------------------------------------
+
+The following variables may be set by contributors to control
+job execution
+
+QEMU_CI
+~~~~~~~
+
+By default, no pipelines will be created on contributor forks
+in order to preserve CI credits
+
+Set this variable to 1 to create the pipelines, but leave all
+the jobs to be manually started from the UI
+
+Set this variable to 2 to create the pipelines and run all
+the jobs immediately, as was the historical behaviour
+
+QEMU_CI_AVOCADO_TESTING
+~~~~~~~~~~~~~~~~~~~~~~~
+By default, tests using the Avocado framework are not run automatically in
+the pipelines (because multiple artifacts have to be downloaded, and if
+these artifacts are not already cached, downloading them make the jobs
+reach the timeout limit). Set this variable to have the tests using the
+Avocado framework run automatically.
+
+Other misc variables
+--------------------
+
+These variables are primarily to control execution of jobs on
+private runners
+
+AARCH64_RUNNER_AVAILABLE
+~~~~~~~~~~~~~~~~~~~~~~~~
+If you've got access to an aarch64 host that can be used as a gitlab-CI
+runner, you can set this variable to enable the tests that require this
+kind of host. The runner should be tagged with "aarch64".
+
+AARCH32_RUNNER_AVAILABLE
+~~~~~~~~~~~~~~~~~~~~~~~~
+If you've got access to an armhf host or an arch64 host that can run
+aarch32 EL0 code to be used as a gitlab-CI runner, you can set this
+variable to enable the tests that require this kind of host. The
+runner should be tagged with "aarch32".
+
+S390X_RUNNER_AVAILABLE
+~~~~~~~~~~~~~~~~~~~~~~
+If you've got access to an IBM Z host that can be used as a gitlab-CI
+runner, you can set this variable to enable the tests that require this
+kind of host. The runner should be tagged with "s390x".
+
+CENTOS_STREAM_8_x86_64_RUNNER_AVAILABLE
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+If you've got access to a CentOS Stream 8 x86_64 host that can be
+used as a gitlab-CI runner, you can set this variable to enable the
+tests that require this kind of host. The runner should be tagged with
+both "centos_stream_8" and "x86_64".
+
+CCACHE_DISABLE
+~~~~~~~~~~~~~~
+The jobs are configured to use "ccache" by default since this typically
+reduces compilation time, at the cost of increased storage. If the
+use of "ccache" is suspected to be hurting the overall job execution
+time, setting the "CCACHE_DISABLE=1" env variable to disable it.
diff --git a/docs/devel/ci-runners.rst b/docs/devel/ci-runners.rst.inc
index 7817001fb2..7817001fb2 100644
--- a/docs/devel/ci-runners.rst
+++ b/docs/devel/ci-runners.rst.inc
diff --git a/docs/devel/ci.rst b/docs/devel/ci.rst
index 8d95247188..ed88a2010b 100644
--- a/docs/devel/ci.rst
+++ b/docs/devel/ci.rst
@@ -1,13 +1,14 @@
+.. _ci:
+
==
CI
==
-QEMU has configurations enabled for a number of different CI services.
-The most up to date information about them and their status can be
-found at::
-
- https://wiki.qemu.org/Testing/CI
+Most of QEMU's CI is run on GitLab's infrastructure although a number
+of other CI services are used for specialised purposes. The most up to
+date information about them and their status can be found on the
+`project wiki testing page <https://wiki.qemu.org/Testing/CI>`_.
-.. include:: ci-definitions.rst
-.. include:: ci-jobs.rst
-.. include:: ci-runners.rst
+.. include:: ci-definitions.rst.inc
+.. include:: ci-jobs.rst.inc
+.. include:: ci-runners.rst.inc
diff --git a/docs/devel/clocks.rst b/docs/devel/clocks.rst
index 675fbeb6ab..177ee1c90d 100644
--- a/docs/devel/clocks.rst
+++ b/docs/devel/clocks.rst
@@ -279,6 +279,10 @@ You can change the multiplier and divider of a clock at runtime,
so you can use this to model clock controller devices which
have guest-programmable frequency multipliers or dividers.
+Similarly to ``clock_set()``, ``clock_set_mul_div()`` returns ``true`` if
+the clock state was modified; that is, if the multiplier or the diviser
+or both were changed by the call.
+
Note that ``clock_set_mul_div()`` does not automatically call
``clock_propagate()``. If you make a runtime change to the
multiplier or divider you must call clock_propagate() yourself.
@@ -502,7 +506,7 @@ This is typically used to migrate an input clock state. For example:
VMStateDescription my_device_vmstate = {
.name = "my_device",
- .fields = (VMStateField[]) {
+ .fields = (const VMStateField[]) {
[...], /* other migrated fields */
VMSTATE_CLOCK(clk, MyDeviceState),
VMSTATE_END_OF_LIST()
diff --git a/docs/devel/code-of-conduct.rst b/docs/devel/code-of-conduct.rst
index 195444d1b4..f734ed0317 100644
--- a/docs/devel/code-of-conduct.rst
+++ b/docs/devel/code-of-conduct.rst
@@ -1,3 +1,5 @@
+.. _code_of_conduct:
+
Code of Conduct
===============
diff --git a/docs/devel/decodetree.rst b/docs/devel/decodetree.rst
index 49ea50c2a7..e3392aa705 100644
--- a/docs/devel/decodetree.rst
+++ b/docs/devel/decodetree.rst
@@ -23,22 +23,42 @@ Fields
Syntax::
- field_def := '%' identifier ( unnamed_field )* ( !function=identifier )?
+ field_def := '%' identifier ( field )* ( !function=identifier )?
+ field := unnamed_field | named_field
unnamed_field := number ':' ( 's' ) number
+ named_field := identifier ':' ( 's' ) number
For *unnamed_field*, the first number is the least-significant bit position
of the field and the second number is the length of the field. If the 's' is
-present, the field is considered signed. If multiple ``unnamed_fields`` are
-present, they are concatenated. In this way one can define disjoint fields.
+present, the field is considered signed.
+
+A *named_field* refers to some other field in the instruction pattern
+or format. Regardless of the length of the other field where it is
+defined, it will be inserted into this field with the specified
+signedness and bit width.
+
+Field definitions that involve loops (i.e. where a field is defined
+directly or indirectly in terms of itself) are errors.
+
+A format can include fields that refer to named fields that are
+defined in the instruction pattern(s) that use the format.
+Conversely, an instruction pattern can include fields that refer to
+named fields that are defined in the format it uses. However you
+cannot currently do both at once (i.e. pattern P uses format F; F has
+a field A that refers to a named field B that is defined in P, and P
+has a field C that refers to a named field D that is defined in F).
+
+If multiple ``fields`` are present, they are concatenated.
+In this way one can define disjoint fields.
If ``!function`` is specified, the concatenated result is passed through the
named function, taking and returning an integral value.
-One may use ``!function`` with zero ``unnamed_fields``. This case is called
+One may use ``!function`` with zero ``fields``. This case is called
a *parameter*, and the named function is only passed the ``DisasContext``
and returns an integral value extracted from there.
-A field with no ``unnamed_fields`` and no ``!function`` is in error.
+A field with no ``fields`` and no ``!function`` is in error.
Field examples:
@@ -56,6 +76,9 @@ Field examples:
| %shimm8 5:s8 13:1 | expand_shimm8(sextract(i, 5, 8) << 1 | |
| !function=expand_shimm8 | extract(i, 13, 1)) |
+---------------------------+---------------------------------------------+
+| %sz_imm 10:2 sz:3 | expand_sz_imm(extract(i, 10, 2) << 3 | |
+| !function=expand_sz_imm | extract(a->sz, 0, 3)) |
++---------------------------+---------------------------------------------+
Argument Sets
=============
diff --git a/docs/devel/docs.rst b/docs/devel/docs.rst
new file mode 100644
index 0000000000..a7768b5311
--- /dev/null
+++ b/docs/devel/docs.rst
@@ -0,0 +1,68 @@
+
+==================
+QEMU Documentation
+==================
+
+QEMU's documentation is written in reStructuredText format and
+built using the Sphinx documentation generator. We generate both
+the HTML manual and the manpages from the some documentation sources.
+
+hxtool and .hx files
+--------------------
+
+The documentation for QEMU command line options and Human Monitor Protocol
+(HMP) commands is written in files with the ``.hx`` suffix. These
+are processed in two ways:
+
+ * ``scripts/hxtool`` creates C header files from them, which are included
+ in QEMU to do things like handle the ``--help`` option output
+ * a Sphinx extension in ``docs/sphinx/hxtool.py`` generates rST output
+ to be included in the HTML or manpage documentation
+
+The syntax of these ``.hx`` files is simple. It is broadly an
+alternation of C code put into the C output and rST format text
+put into the documentation. A few special directives are recognised;
+these are all-caps and must be at the beginning of the line.
+
+``HXCOMM`` is the comment marker. The line, including any arbitrary
+text after the marker, is discarded and appears neither in the C output
+nor the documentation output.
+
+``SRST`` starts a reStructuredText section. Following lines
+are put into the documentation verbatim, and discarded from the C output.
+The alternative form ``SRST()`` is used to define a label which can be
+referenced from elsewhere in the rST documentation. The label will take
+the form ``<DOCNAME-HXFILE-LABEL>``, where ``DOCNAME`` is the name of the
+top level rST file, ``HXFILE`` is the filename of the .hx file without
+the ``.hx`` extension, and ``LABEL`` is the text provided within the
+``SRST()`` directive. For example,
+``<system/invocation-qemu-options-initrd>``.
+
+``ERST`` ends the documentation section started with ``SRST``,
+and switches back to a C code section.
+
+``DEFHEADING()`` defines a heading that should appear in both the
+``--help`` output and in the documentation. This directive should
+be in the C code block. If there is a string inside the brackets,
+this is the heading to use. If this string is empty, it produces
+a blank line in the ``--help`` output and is ignored for the rST
+output.
+
+``ARCHHEADING()`` is a variant of ``DEFHEADING()`` which produces
+the heading only if the specified guest architecture was compiled
+into QEMU. This should be avoided in new documentation.
+
+Within C code sections, you should check the comments at the top
+of the file to see what the expected usage is, because this
+varies between files. For instance in ``qemu-options.hx`` we use
+the ``DEF()`` macro to define each option and specify its ``--help``
+text, but in ``hmp-commands.hx`` the C code sections are elements
+of an array of structs of type ``HMPCommand`` which define the
+name, behaviour and help text for each monitor command.
+
+In the file ``qemu-options.hx``, do not try to explicitly define a
+reStructuredText label within a documentation section. This file
+is included into two separate Sphinx documents, and some
+versions of Sphinx will complain about the duplicate label
+that results. Use the ``SRST()`` directive documented above, to
+emit an unambiguous label.
diff --git a/docs/devel/fuzzing.rst b/docs/devel/fuzzing.rst
index 2749bb9bed..3bfcb33fc4 100644
--- a/docs/devel/fuzzing.rst
+++ b/docs/devel/fuzzing.rst
@@ -19,11 +19,6 @@ responsibility to ensure that state is reset between fuzzing-runs.
Building the fuzzers
--------------------
-*NOTE*: If possible, build a 32-bit binary. When forking, the 32-bit fuzzer is
-much faster, since the page-map has a smaller size. This is due to the fact that
-AddressSanitizer maps ~20TB of memory, as part of its detection. This results
-in a large page-map, and a much slower ``fork()``.
-
To build the fuzzers, install a recent version of clang:
Configure with (substitute the clang binaries with the version you installed).
Here, enable-sanitizers, is optional but it allows us to reliably detect bugs
@@ -182,10 +177,11 @@ The output should contain a complete list of matched MemoryRegions.
OSS-Fuzz
--------
-QEMU is continuously fuzzed on `OSS-Fuzz` __(https://github.com/google/oss-fuzz).
-By default, the OSS-Fuzz build will try to fuzz every fuzz-target. Since the
-generic-fuzz target requires additional information provided in environment
-variables, we pre-define some generic-fuzz configs in
+QEMU is continuously fuzzed on `OSS-Fuzz
+<https://github.com/google/oss-fuzz>`_. By default, the OSS-Fuzz build
+will try to fuzz every fuzz-target. Since the generic-fuzz target
+requires additional information provided in environment variables, we
+pre-define some generic-fuzz configs in
``tests/qtest/fuzz/generic_fuzz_configs.h``. Each config must specify:
- ``.name``: To identify the fuzzer config
@@ -286,8 +282,8 @@ select the fuzz target. Then, the qtest client is initialized. If the target
requires qos, qgraph is set up and the QOM/LIBQOS modules are initialized.
Then the QGraph is walked and the QEMU cmd_line is determined and saved.
-After this, the ``vl.c:qemu_main`` is called to set up the guest. There are
-target-specific hooks that can be called before and after qemu_main, for
+After this, the ``vl.c:main`` is called to set up the guest. There are
+target-specific hooks that can be called before and after main, for
additional setup(e.g. PCI setup, or VM snapshotting).
``LLVMFuzzerTestOneInput``: Uses qtest/qos functions to act based on the fuzz
@@ -295,10 +291,9 @@ input. It is also responsible for manually calling ``main_loop_wait`` to ensure
that bottom halves are executed and any cleanup required before the next input.
Since the same process is reused for many fuzzing runs, QEMU state needs to
-be reset at the end of each run. There are currently two implemented
-options for resetting state:
+be reset at the end of each run. For example, this can be done by rebooting the
+VM, after each run.
-- Reboot the guest between runs.
- *Pros*: Straightforward and fast for simple fuzz targets.
- *Cons*: Depending on the device, does not reset all device state. If the
@@ -307,15 +302,3 @@ options for resetting state:
reboot.
- *Example target*: ``i440fx-qtest-reboot-fuzz``
-
-- Run each test case in a separate forked process and copy the coverage
- information back to the parent. This is fairly similar to AFL's "deferred"
- fork-server mode [3]
-
- - *Pros*: Relatively fast. Devices only need to be initialized once. No need to
- do slow reboots or vmloads.
-
- - *Cons*: Not officially supported by libfuzzer. Does not work well for
- devices that rely on dedicated threads.
-
- - *Example target*: ``virtio-net-fork-fuzz``
diff --git a/docs/devel/index-api.rst b/docs/devel/index-api.rst
new file mode 100644
index 0000000000..fe01b2b488
--- /dev/null
+++ b/docs/devel/index-api.rst
@@ -0,0 +1,18 @@
+Internal QEMU APIs
+------------------
+
+Details about how QEMU's various internal APIs. Most of these are
+generated from in-code annotations to function prototypes.
+
+.. toctree::
+ :maxdepth: 2
+
+ bitops
+ loads-stores
+ memory
+ modules
+ pci
+ qom-api
+ qdev-api
+ ui
+ zoned-storage
diff --git a/docs/devel/index-build.rst b/docs/devel/index-build.rst
new file mode 100644
index 0000000000..90b406ca0e
--- /dev/null
+++ b/docs/devel/index-build.rst
@@ -0,0 +1,20 @@
+QEMU Build and Test System
+--------------------------
+
+Details about how QEMU's build system works and how it is integrated
+into our testing infrastructure. You will need to understand some of
+the basics if you are adding new files and targets to the build.
+
+.. toctree::
+ :maxdepth: 3
+
+ build-system
+ kconfig
+ docs
+ testing
+ acpi-bits
+ qtest
+ ci
+ qapi-code-gen
+ fuzzing
+ control-flow-integrity
diff --git a/docs/devel/index-internals.rst b/docs/devel/index-internals.rst
new file mode 100644
index 0000000000..5636e9cf1d
--- /dev/null
+++ b/docs/devel/index-internals.rst
@@ -0,0 +1,22 @@
+Internal Subsystem Information
+------------------------------
+
+Details about QEMU's various subsystems including how to add features to them.
+
+.. toctree::
+ :maxdepth: 2
+
+ qom
+ atomics
+ block-coroutine-wrapper
+ clocks
+ ebpf_rss
+ migration/index
+ multi-process
+ reset
+ s390-cpu-topology
+ s390-dasd-ipl
+ tracing
+ vfio-iommufd
+ writing-monitor-commands
+ virtio-backends
diff --git a/docs/devel/index-process.rst b/docs/devel/index-process.rst
new file mode 100644
index 0000000000..362f97ee30
--- /dev/null
+++ b/docs/devel/index-process.rst
@@ -0,0 +1,19 @@
+.. _development_process:
+
+QEMU Community Processes
+------------------------
+
+Notes about how to interact with the community and how and where to submit patches.
+
+.. toctree::
+ :maxdepth: 2
+
+ code-of-conduct
+ conflict-resolution
+ maintainers
+ style
+ submitting-a-patch
+ trivial-patches
+ stable-process
+ submitting-a-pull-request
+ secure-coding-practices
diff --git a/docs/devel/index-tcg.rst b/docs/devel/index-tcg.rst
new file mode 100644
index 0000000000..a992844e5c
--- /dev/null
+++ b/docs/devel/index-tcg.rst
@@ -0,0 +1,19 @@
+.. _tcg:
+
+TCG Emulation
+-------------
+
+Details about QEMU's Tiny Code Generator and the infrastructure
+associated with emulation. You do not need to worry about this if you
+are only implementing things for HW accelerated hypervisors.
+
+.. toctree::
+ :maxdepth: 2
+
+ tcg
+ tcg-ops
+ decodetree
+ multi-thread-tcg
+ tcg-icount
+ tcg-plugins
+ replay
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index f95df10b3e..abf60457c2 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -2,46 +2,35 @@
Developer Information
---------------------
-This section of the manual documents various parts of the internals of QEMU.
-You only need to read it if you are interested in reading or
+This section of the manual documents various parts of the internals of
+QEMU. You only need to read it if you are interested in reading or
modifying QEMU's source code.
+QEMU is a large and mature project with a number of complex subsystems
+that can be overwhelming to understand. The development documentation
+is not comprehensive but hopefully presents enough to get you started.
+If there are areas that are unclear please reach out either via the
+IRC channel or mailing list and hopefully we can improve the
+documentation for future developers.
+
+All developers will want to familiarise themselves with
+:ref:`development_process` and how the community interacts. Please pay
+particular attention to the :ref:`coding-style` and
+:ref:`submitting-a-patch` sections to avoid common pitfalls.
+
+If you wish to implement a new hardware model you will want to read
+through the :ref:`qom` documentation to understand how QEMU's object
+model works.
+
+Those wishing to enhance or add new CPU emulation capabilities will
+want to read our :ref:`tcg` documentation, especially the overview of
+the :ref:`tcg_internals`.
+
.. toctree::
- :maxdepth: 2
- :includehidden:
+ :maxdepth: 1
- code-of-conduct
- conflict-resolution
- build-system
- style
- kconfig
- testing
- fuzzing
- control-flow-integrity
- loads-stores
- memory
- migration
- atomics
- stable-process
- ci
- qtest
- decodetree
- secure-coding-practices
- tcg
- tcg-icount
- tracing
- multi-thread-tcg
- tcg-plugins
- bitops
- ui
- reset
- s390-dasd-ipl
- clocks
- qom
- modules
- block-coroutine-wrapper
- multi-process
- ebpf_rss
- vfio-migration
- qapi-code-gen
- writing-qmp-commands
+ index-process
+ index-build
+ index-api
+ index-internals
+ index-tcg
diff --git a/docs/devel/kconfig.rst b/docs/devel/kconfig.rst
index a1cdbec751..ccb9a46bd7 100644
--- a/docs/devel/kconfig.rst
+++ b/docs/devel/kconfig.rst
@@ -192,11 +192,15 @@ declares its dependencies in different ways:
no directive and are not used in the Makefile either; they only appear
as conditions for ``default y`` directives.
- QEMU currently has two device groups, ``PCI_DEVICES`` and
- ``TEST_DEVICES``. PCI devices usually have a ``default y if
+ QEMU currently has three device groups, ``PCI_DEVICES``, ``I2C_DEVICES``,
+ and ``TEST_DEVICES``. PCI devices usually have a ``default y if
PCI_DEVICES`` directive rather than just ``default y``. This lets
some boards (notably s390) easily support a subset of PCI devices,
for example only VFIO (passthrough) and virtio-pci devices.
+ ``I2C_DEVICES`` is similar to ``PCI_DEVICES``. It contains i2c devices
+ that users might reasonably want to plug in to an i2c bus on any
+ board (and not ones which are very board-specific or that need
+ to be wired up in a way that can't be done on the command line).
``TEST_DEVICES`` instead is used for devices that are rarely used on
production virtual machines, but provide useful hooks to test QEMU
or KVM.
@@ -270,7 +274,7 @@ or commenting out lines in the second group.
It is also possible to run QEMU's configure script with the
``--without-default-devices`` option. When this is done, everything defaults
-to ``n`` unless it is ``select``ed or explicitly switched on in the
+to ``n`` unless it is ``select``\ ed or explicitly switched on in the
``.mak`` files. In other words, ``default`` and ``imply`` directives
are disabled. When QEMU is built with this option, the user will probably
want to change some lines in the first group, for example like this::
@@ -278,9 +282,19 @@ want to change some lines in the first group, for example like this::
CONFIG_PCI_DEVICES=y
#CONFIG_TEST_DEVICES=n
-and/or pick a subset of the devices in those device groups. Right now
-there is no single place that lists all the optional devices for
-``CONFIG_PCI_DEVICES`` and ``CONFIG_TEST_DEVICES``. In the future,
+and/or pick a subset of the devices in those device groups. Without
+further modifications to ``configs/devices/``, a system emulator built
+without default devices might not do much more than start an empty
+machine, and even then only if ``--nodefaults`` is specified on the
+command line. Starting a VM *without* ``--nodefaults`` is allowed to
+fail, but should never abort. Failures in ``make check`` with
+``--without-default-devices`` are considered bugs in the test code:
+the tests should either use ``--nodefaults``, and should be skipped
+if a necessary device is not present in the build. Such failures
+should not be worked around with ``select`` directives.
+
+Right now there is no single place that lists all the optional devices
+for ``CONFIG_PCI_DEVICES`` and ``CONFIG_TEST_DEVICES``. In the future,
we expect that ``.mak`` files will be automatically generated, so that
they will include all these symbols and some help text on what they do.
@@ -301,7 +315,7 @@ and also listed as follows in the top-level meson.build's host_kconfig
variable::
host_kconfig = \
- ('CONFIG_TPM' in config_host ? ['CONFIG_TPM=y'] : []) + \
- ('CONFIG_SPICE' in config_host ? ['CONFIG_SPICE=y'] : []) + \
+ (have_tpm ? ['CONFIG_TPM=y'] : []) + \
+ (host_os == 'linux' ? ['CONFIG_LINUX=y'] : []) + \
(have_ivshmem ? ['CONFIG_IVSHMEM=y'] : []) + \
...
diff --git a/docs/devel/loads-stores.rst b/docs/devel/loads-stores.rst
index 568274baec..ec627aa9c0 100644
--- a/docs/devel/loads-stores.rst
+++ b/docs/devel/loads-stores.rst
@@ -36,6 +36,7 @@ store: ``st{size}_{endian}_p(ptr, val)``
``size``
- ``b`` : 8 bits
- ``w`` : 16 bits
+ - ``24`` : 24 bits
- ``l`` : 32 bits
- ``q`` : 64 bits
@@ -62,21 +63,26 @@ which stores ``val`` to ``ptr`` as an ``{endian}`` order value
of size ``sz`` bytes.
-Regexes for git grep
+Regexes for git grep:
- ``\<ld[us]\?[bwlq]\(_[hbl]e\)\?_p\>``
- ``\<st[bwlq]\(_[hbl]e\)\?_p\>``
- - ``\<ldn_\([hbl]e\)?_p\>``
- - ``\<stn_\([hbl]e\)?_p\>``
+ - ``\<st24\(_[hbl]e\)\?_p\>``
+ - ``\<ldn_\([hbl]e\)\?_p\>``
+ - ``\<stn_\([hbl]e\)\?_p\>``
-``cpu_{ld,st}*_mmuidx_ra``
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+``cpu_{ld,st}*_mmu``
+~~~~~~~~~~~~~~~~~~~~
-These functions operate on a guest virtual address plus a context,
-known as a "mmu index" or ``mmuidx``, which controls how that virtual
-address is translated. The meaning of the indexes are target specific,
-but specifying a particular index might be necessary if, for instance,
-the helper requires an "always as non-privileged" access rather that
-the default access for the current state of the guest CPU.
+These functions operate on a guest virtual address, plus a context
+known as a "mmu index" which controls how that virtual address is
+translated, plus a ``MemOp`` which contains alignment requirements
+among other things. The ``MemOp`` and mmu index are combined into
+a single argument of type ``MemOpIdx``.
+
+The meaning of the indexes are target specific, but specifying a
+particular index might be necessary if, for instance, the helper
+requires a "always as non-privileged" access rather than the
+default access for the current state of the guest CPU.
These functions may cause a guest CPU exception to be taken
(e.g. for an alignment fault or MMU fault) which will result in
@@ -99,6 +105,35 @@ function, which is a return address into the generated code [#gpc]_.
Function names follow the pattern:
+load: ``cpu_ld{size}{end}_mmu(env, ptr, oi, retaddr)``
+
+store: ``cpu_st{size}{end}_mmu(env, ptr, val, oi, retaddr)``
+
+``size``
+ - ``b`` : 8 bits
+ - ``w`` : 16 bits
+ - ``l`` : 32 bits
+ - ``q`` : 64 bits
+
+``end``
+ - (empty) : for target endian, or 8 bit sizes
+ - ``_be`` : big endian
+ - ``_le`` : little endian
+
+Regexes for git grep:
+ - ``\<cpu_ld[bwlq]\(_[bl]e\)\?_mmu\>``
+ - ``\<cpu_st[bwlq]\(_[bl]e\)\?_mmu\>``
+
+
+``cpu_{ld,st}*_mmuidx_ra``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+These functions work like the ``cpu_{ld,st}_mmu`` functions except
+that the ``mmuidx`` parameter is not combined with a ``MemOp``,
+and therefore there is no required alignment supplied or enforced.
+
+Function names follow the pattern:
+
load: ``cpu_ld{sign}{size}{end}_mmuidx_ra(env, ptr, mmuidx, retaddr)``
store: ``cpu_st{size}{end}_mmuidx_ra(env, ptr, val, mmuidx, retaddr)``
@@ -120,8 +155,8 @@ store: ``cpu_st{size}{end}_mmuidx_ra(env, ptr, val, mmuidx, retaddr)``
- ``_le`` : little endian
Regexes for git grep:
- - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_mmuidx_ra\>``
- - ``\<cpu_st[bwlq](_[bl]e)\?_mmuidx_ra\>``
+ - ``\<cpu_ld[us]\?[bwlq]\(_[bl]e\)\?_mmuidx_ra\>``
+ - ``\<cpu_st[bwlq]\(_[bl]e\)\?_mmuidx_ra\>``
``cpu_{ld,st}*_data_ra``
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -132,7 +167,8 @@ of the guest CPU, as determined by ``cpu_mmu_index(env, false)``.
These are generally the preferred way to do accesses by guest
virtual address from helper functions, unless the access should
-be performed with a context other than the default.
+be performed with a context other than the default, or alignment
+should be enforced for the access.
Function names follow the pattern:
@@ -157,8 +193,8 @@ store: ``cpu_st{size}{end}_data_ra(env, ptr, val, ra)``
- ``_le`` : little endian
Regexes for git grep:
- - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_data_ra\>``
- - ``\<cpu_st[bwlq](_[bl]e)\?_data_ra\>``
+ - ``\<cpu_ld[us]\?[bwlq]\(_[bl]e\)\?_data_ra\>``
+ - ``\<cpu_st[bwlq]\(_[bl]e\)\?_data_ra\>``
``cpu_{ld,st}*_data``
~~~~~~~~~~~~~~~~~~~~~
@@ -195,9 +231,9 @@ store: ``cpu_st{size}{end}_data(env, ptr, val)``
- ``_be`` : big endian
- ``_le`` : little endian
-Regexes for git grep
- - ``\<cpu_ld[us]\?[bwlq](_[bl]e)\?_data\>``
- - ``\<cpu_st[bwlq](_[bl]e)\?_data\+\>``
+Regexes for git grep:
+ - ``\<cpu_ld[us]\?[bwlq]\(_[bl]e\)\?_data\>``
+ - ``\<cpu_st[bwlq]\(_[bl]e\)\?_data\+\>``
``cpu_ld*_code``
~~~~~~~~~~~~~~~~
@@ -241,7 +277,7 @@ called during the translator callback ``translate_insn``.
There is a set of functions ending in ``_swap`` which, if the parameter
is true, returns the value in the endianness that is the reverse of
-the guest native endianness, as determined by ``TARGET_WORDS_BIGENDIAN``.
+the guest native endianness, as determined by ``TARGET_BIG_ENDIAN``.
Function names follow the pattern:
@@ -260,34 +296,23 @@ swap: ``translator_ld{sign}{size}_swap(env, ptr, swap)``
- ``l`` : 32 bits
- ``q`` : 64 bits
-Regexes for git grep
+Regexes for git grep:
- ``\<translator_ld[us]\?[bwlq]\(_swap\)\?\>``
-``helper_*_{ld,st}*_mmu``
+``helper_{ld,st}*_mmu``
~~~~~~~~~~~~~~~~~~~~~~~~~
These functions are intended primarily to be called by the code
-generated by the TCG backend. They may also be called by target
-CPU helper function code. Like the ``cpu_{ld,st}_mmuidx_ra`` functions
-they perform accesses by guest virtual address, with a given ``mmuidx``.
-
-These functions specify an ``opindex`` parameter which encodes
-(among other things) the mmu index to use for the access. This parameter
-should be created by calling ``make_memop_idx()``.
+generated by the TCG backend. Like the ``cpu_{ld,st}_mmu`` functions
+they perform accesses by guest virtual address, with a given ``MemOpIdx``.
-The ``retaddr`` parameter should be the result of GETPC() called directly
-from the top level HELPER(foo) function (or 0 if no guest CPU state
-unwinding is required).
+They differ from ``cpu_{ld,st}_mmu`` in that they take the endianness
+of the operation only from the MemOpIdx, and loads extend the return
+value to the size of a host general register (``tcg_target_ulong``).
-**TODO** The names of these functions are a bit odd for historical
-reasons because they were originally expected to be called only from
-within generated code. We should rename them to bring them more in
-line with the other memory access functions. The explicit endianness
-is the only feature they have beyond ``*_mmuidx_ra``.
+load: ``helper_ld{sign}{size}_mmu(env, addr, opindex, retaddr)``
-load: ``helper_{endian}_ld{sign}{size}_mmu(env, addr, opindex, retaddr)``
-
-store: ``helper_{endian}_st{size}_mmu(env, addr, val, opindex, retaddr)``
+store: ``helper_{size}_mmu(env, addr, val, opindex, retaddr)``
``sign``
- (empty) : for 32 or 64 bit sizes
@@ -300,14 +325,9 @@ store: ``helper_{endian}_st{size}_mmu(env, addr, val, opindex, retaddr)``
- ``l`` : 32 bits
- ``q`` : 64 bits
-``endian``
- - ``le`` : little endian
- - ``be`` : big endian
- - ``ret`` : target endianness
-
-Regexes for git grep
- - ``\<helper_\(le\|be\|ret\)_ld[us]\?[bwlq]_mmu\>``
- - ``\<helper_\(le\|be\|ret\)_st[bwlq]_mmu\>``
+Regexes for git grep:
+ - ``\<helper_ld[us]\?[bwlq]_mmu\>``
+ - ``\<helper_st[bwlq]_mmu\>``
``address_space_*``
~~~~~~~~~~~~~~~~~~~
@@ -362,7 +382,7 @@ succeeded using a MemTxResult return code.
The ``_{endian}`` suffix is omitted for byte accesses.
-Regexes for git grep
+Regexes for git grep:
- ``\<address_space_\(read\|write\|rw\)\>``
- ``\<address_space_ldu\?[bwql]\(_[lb]e\)\?\>``
- ``\<address_space_st[bwql]\(_[lb]e\)\?\>``
@@ -380,7 +400,7 @@ Note that portions of the write which attempt to write data to a
device will be silently ignored -- only real RAM and ROM will
be written to.
-Regexes for git grep
+Regexes for git grep:
- ``address_space_write_rom``
``{ld,st}*_phys``
@@ -418,7 +438,7 @@ device doing the access has no way to report such an error.
The ``_{endian}_`` infix is omitted for byte accesses.
-Regexes for git grep
+Regexes for git grep:
- ``\<ldu\?[bwlq]\(_[bl]e\)\?_phys\>``
- ``\<st[bwlq]\(_[bl]e\)\?_phys\>``
@@ -442,7 +462,7 @@ For new code they are better avoided:
``cpu_physical_memory_rw``
-Regexes for git grep
+Regexes for git grep:
- ``\<cpu_physical_memory_\(read\|write\|rw\)\>``
``cpu_memory_rw_debug``
@@ -477,7 +497,7 @@ make sure our existing code is doing things correctly.
``dma_memory_rw``
-Regexes for git grep
+Regexes for git grep:
- ``\<dma_memory_\(read\|write\|rw\)\>``
- ``\<ldu\?[bwlq]\(_[bl]e\)\?_dma\>``
- ``\<st[bwlq]\(_[bl]e\)\?_dma\>``
@@ -518,7 +538,7 @@ correct address space for that device.
The ``_{endian}_`` infix is omitted for byte accesses.
-Regexes for git grep
+Regexes for git grep:
- ``\<pci_dma_\(read\|write\|rw\)\>``
- ``\<ldu\?[bwlq]\(_[bl]e\)\?_pci_dma\>``
- ``\<st[bwlq]\(_[bl]e\)\?_pci_dma\>``
diff --git a/docs/devel/maintainers.rst b/docs/devel/maintainers.rst
new file mode 100644
index 0000000000..5c907d901c
--- /dev/null
+++ b/docs/devel/maintainers.rst
@@ -0,0 +1,107 @@
+.. _maintainers:
+
+The Role of Maintainers
+=======================
+
+Maintainers are a critical part of the project's contributor ecosystem.
+They come from a wide range of backgrounds from unpaid hobbyists
+working in their spare time to employees who work on the project as
+part of their job. Maintainer activities include:
+
+ - reviewing patches and suggesting changes
+ - collecting patches and preparing pull requests
+ - tending to the long term health of their area
+ - participating in other project activities
+
+They are also human and subject to the same pressures as everyone else
+including overload and burnout. Like everyone else they are subject
+to project's :ref:`code_of_conduct` and should also be exemplars of
+excellent community collaborators.
+
+The MAINTAINERS file
+--------------------
+
+The `MAINTAINERS
+<https://gitlab.com/qemu-project/qemu/-/blob/master/MAINTAINERS>`__
+file contains the canonical list of who is a maintainer. The file
+is machine readable so an appropriately configured git (see
+:ref:`cc_the_relevant_maintainer`) can automatically Cc them on
+patches that touch their area of code.
+
+The file also describes the status of the area of code to give an idea
+of how actively that section is maintained.
+
+.. list-table:: Meaning of support status in MAINTAINERS
+ :widths: 25 75
+ :header-rows: 1
+
+ * - Status
+ - Meaning
+ * - Supported
+ - Someone is actually paid to look after this.
+ * - Maintained
+ - Someone actually looks after it.
+ * - Odd Fixes
+ - It has a maintainer but they don't have time to do
+ much other than throw the odd patch in.
+ * - Orphan
+ - No current maintainer.
+ * - Obsolete
+ - Old obsolete code, should use something else.
+
+Please bear in mind that even if someone is paid to support something
+it does not mean they are paid to support you. This is open source and
+the code comes with no warranty and the project makes no guarantees
+about dealing with bugs or features requests.
+
+
+
+Becoming a reviewer
+-------------------
+
+Most maintainers start by becoming subsystem reviewers. While anyone
+is welcome to review code on the mailing list getting added to the
+MAINTAINERS file with a line like::
+
+ R: Random Hacker <rhacker@example.com>
+
+marks you as a 'designated reviewer' - expected to provide regular
+spontaneous feedback. This will ensure that patches touching a given
+subsystem will automatically be CC'd to you.
+
+Becoming a maintainer
+---------------------
+
+Maintainers are volunteers who put themselves forward or have been
+asked by others to keep an eye on an area of code. They have generally
+demonstrated to the community, usually via contributions and code
+reviews, that they have a good understanding of the subsystem. They
+are also trusted to make a positive contribution to the project and
+work well with the other contributors.
+
+The process is simple - simply send a patch to the list that updates
+the ``MAINTAINERS`` file. Sometimes this is done as part of a larger
+series when a new sub-system is being added to the code base. This can
+also be done by a retiring maintainer who nominates their replacement
+after discussion with other contributors.
+
+Once the patch is reviewed and merged the only other step is to make
+sure your GPG key is signed.
+
+.. _maintainer_keys:
+
+Maintainer GPG Keys
+~~~~~~~~~~~~~~~~~~~
+
+GPG is used to sign pull requests so they can be identified as really
+coming from the maintainer. If your key is not already signed by
+members of the QEMU community, you should make arrangements to attend
+a `KeySigningParty <https://wiki.qemu.org/KeySigningParty>`__ (for
+example at KVM Forum) or make alternative arrangements to have your
+key signed by an attendee. Key signing requires meeting another
+community member **in person** [#]_ so please make appropriate
+arrangements.
+
+.. [#] In recent pandemic times we have had to exercise some
+ flexibility here. Maintainers still need to sign their pull
+ requests though.
diff --git a/docs/devel/memory.rst b/docs/devel/memory.rst
index 5dc8a12682..69c5e3f914 100644
--- a/docs/devel/memory.rst
+++ b/docs/devel/memory.rst
@@ -67,11 +67,15 @@ MemoryRegion):
You initialize a pure container with memory_region_init().
-- alias: a subsection of another region. Aliases allow a region to be
- split apart into discontiguous regions. Examples of uses are memory banks
- used when the guest address space is smaller than the amount of RAM
- addressed, or a memory controller that splits main memory to expose a "PCI
- hole". Aliases may point to any type of region, including other aliases,
+- alias: a subsection of another region. Aliases allow a region to be
+ split apart into discontiguous regions. Examples of uses are memory
+ banks used when the guest address space is smaller than the amount
+ of RAM addressed, or a memory controller that splits main memory to
+ expose a "PCI hole". You can also create aliases to avoid trying to
+ add the original region to multiple parents via
+ `memory_region_add_subregion`.
+
+ Aliases may point to any type of region, including other aliases,
but an alias may not point back to itself, directly or indirectly.
You initialize these with memory_region_init_alias().
diff --git a/docs/devel/migration/CPR.rst b/docs/devel/migration/CPR.rst
new file mode 100644
index 0000000000..63c36470cf
--- /dev/null
+++ b/docs/devel/migration/CPR.rst
@@ -0,0 +1,147 @@
+CheckPoint and Restart (CPR)
+============================
+
+CPR is the umbrella name for a set of migration modes in which the
+VM is migrated to a new QEMU instance on the same host. It is
+intended for use when the goal is to update host software components
+that run the VM, such as QEMU or even the host kernel. At this time,
+cpr-reboot is the only available mode.
+
+Because QEMU is restarted on the same host, with access to the same
+local devices, CPR is allowed in certain cases where normal migration
+would be blocked. However, the user must not modify the contents of
+guest block devices between quitting old QEMU and starting new QEMU.
+
+CPR unconditionally stops VM execution before memory is saved, and
+thus does not depend on any form of dirty page tracking.
+
+cpr-reboot mode
+---------------
+
+In this mode, QEMU stops the VM, and writes VM state to the migration
+URI, which will typically be a file. After quitting QEMU, the user
+resumes by running QEMU with the ``-incoming`` option. Because the
+old and new QEMU instances are not active concurrently, the URI cannot
+be a type that streams data from one instance to the other.
+
+Guest RAM can be saved in place if backed by shared memory, or can be
+copied to a file. The former is more efficient and is therefore
+preferred.
+
+After state and memory are saved, the user may update userland host
+software before restarting QEMU and resuming the VM. Further, if
+the RAM is backed by persistent shared memory, such as a DAX device,
+then the user may reboot to a new host kernel before restarting QEMU.
+
+This mode supports VFIO devices provided the user first puts the
+guest in the suspended runstate, such as by issuing the
+``guest-suspend-ram`` command to the QEMU guest agent. The agent
+must be pre-installed in the guest, and the guest must support
+suspend to RAM. Beware that suspension can take a few seconds, so
+the user should poll to see the suspended state before proceeding
+with the CPR operation.
+
+Usage
+^^^^^
+
+It is recommended that guest RAM be backed with some type of shared
+memory, such as ``memory-backend-file,share=on``, and that the
+``x-ignore-shared`` capability be set. This combination allows memory
+to be saved in place. Otherwise, after QEMU stops the VM, all guest
+RAM is copied to the migration URI.
+
+Outgoing:
+ * Set the migration mode parameter to ``cpr-reboot``.
+ * Set the ``x-ignore-shared`` capability if desired.
+ * Issue the ``migrate`` command. It is recommended the the URI be a
+ ``file`` type, but one can use other types such as ``exec``,
+ provided the command captures all the data from the outgoing side,
+ and provides all the data to the incoming side.
+ * Quit when QEMU reaches the postmigrate state.
+
+Incoming:
+ * Start QEMU with the ``-incoming defer`` option.
+ * Set the migration mode parameter to ``cpr-reboot``.
+ * Set the ``x-ignore-shared`` capability if desired.
+ * Issue the ``migrate-incoming`` command.
+ * If the VM was running when the outgoing ``migrate`` command was
+ issued, then QEMU automatically resumes VM execution.
+
+Example 1
+^^^^^^^^^
+::
+
+ # qemu-kvm -monitor stdio
+ -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/dax0.0,align=2M,share=on -m 4G
+ ...
+
+ (qemu) info status
+ VM status: running
+ (qemu) migrate_set_parameter mode cpr-reboot
+ (qemu) migrate_set_capability x-ignore-shared on
+ (qemu) migrate -d file:vm.state
+ (qemu) info status
+ VM status: paused (postmigrate)
+ (qemu) quit
+
+ ### optionally update kernel and reboot
+ # systemctl kexec
+ kexec_core: Starting new kernel
+ ...
+
+ # qemu-kvm ... -incoming defer
+ (qemu) info status
+ VM status: paused (inmigrate)
+ (qemu) migrate_set_parameter mode cpr-reboot
+ (qemu) migrate_set_capability x-ignore-shared on
+ (qemu) migrate_incoming file:vm.state
+ (qemu) info status
+ VM status: running
+
+Example 2: VFIO
+^^^^^^^^^^^^^^^
+::
+
+ # qemu-kvm -monitor stdio
+ -object memory-backend-file,id=ram0,size=4G,mem-path=/dev/dax0.0,align=2M,share=on -m 4G
+ -device vfio-pci, ...
+ -chardev socket,id=qga0,path=qga.sock,server=on,wait=off
+ -device virtserialport,chardev=qga0,name=org.qemu.guest_agent.0
+ ...
+
+ (qemu) info status
+ VM status: running
+
+ # echo '{"execute":"guest-suspend-ram"}' | ncat --send-only -U qga.sock
+
+ (qemu) info status
+ VM status: paused (suspended)
+ (qemu) migrate_set_parameter mode cpr-reboot
+ (qemu) migrate_set_capability x-ignore-shared on
+ (qemu) migrate -d file:vm.state
+ (qemu) info status
+ VM status: paused (postmigrate)
+ (qemu) quit
+
+ ### optionally update kernel and reboot
+ # systemctl kexec
+ kexec_core: Starting new kernel
+ ...
+
+ # qemu-kvm ... -incoming defer
+ (qemu) info status
+ VM status: paused (inmigrate)
+ (qemu) migrate_set_parameter mode cpr-reboot
+ (qemu) migrate_set_capability x-ignore-shared on
+ (qemu) migrate_incoming file:vm.state
+ (qemu) info status
+ VM status: paused (suspended)
+ (qemu) system_wakeup
+ (qemu) info status
+ VM status: running
+
+Caveats
+^^^^^^^
+
+cpr-reboot mode may not be used with postcopy, background-snapshot,
+or COLO.
diff --git a/docs/devel/migration/best-practices.rst b/docs/devel/migration/best-practices.rst
new file mode 100644
index 0000000000..d7c34a3014
--- /dev/null
+++ b/docs/devel/migration/best-practices.rst
@@ -0,0 +1,48 @@
+==============
+Best practices
+==============
+
+Debugging
+=========
+
+The migration stream can be analyzed thanks to ``scripts/analyze-migration.py``.
+
+Example usage:
+
+.. code-block:: shell
+
+ $ qemu-system-x86_64 -display none -monitor stdio
+ (qemu) migrate "exec:cat > mig"
+ (qemu) q
+ $ ./scripts/analyze-migration.py -f mig
+ {
+ "ram (3)": {
+ "section sizes": {
+ "pc.ram": "0x0000000008000000",
+ ...
+
+See also ``analyze-migration.py -h`` help for more options.
+
+Firmware
+========
+
+Migration migrates the copies of RAM and ROM, and thus when running
+on the destination it includes the firmware from the source. Even after
+resetting a VM, the old firmware is used. Only once QEMU has been restarted
+is the new firmware in use.
+
+- Changes in firmware size can cause changes in the required RAMBlock size
+ to hold the firmware and thus migration can fail. In practice it's best
+ to pad firmware images to convenient powers of 2 with plenty of space
+ for growth.
+
+- Care should be taken with device emulation code so that newer
+ emulation code can work with older firmware to allow forward migration.
+
+- Care should be taken with newer firmware so that backward migration
+ to older systems with older device emulation code will work.
+
+In some cases it may be best to tie specific firmware versions to specific
+versioned machine types to cut down on the combinations that will need
+support. This is also useful when newer versions of firmware outgrow
+the padding.
diff --git a/docs/devel/migration/compatibility.rst b/docs/devel/migration/compatibility.rst
new file mode 100644
index 0000000000..5a5417ef06
--- /dev/null
+++ b/docs/devel/migration/compatibility.rst
@@ -0,0 +1,517 @@
+Backwards compatibility
+=======================
+
+How backwards compatibility works
+---------------------------------
+
+When we do migration, we have two QEMU processes: the source and the
+target. There are two cases, they are the same version or they are
+different versions. The easy case is when they are the same version.
+The difficult one is when they are different versions.
+
+There are two things that are different, but they have very similar
+names and sometimes get confused:
+
+- QEMU version
+- machine type version
+
+Let's start with a practical example, we start with:
+
+- qemu-system-x86_64 (v5.2), from now on qemu-5.2.
+- qemu-system-x86_64 (v5.1), from now on qemu-5.1.
+
+Related to this are the "latest" machine types defined on each of
+them:
+
+- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
+- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1
+
+First of all, migration is only supposed to work if you use the same
+machine type in both source and destination. The QEMU hardware
+configuration needs to be the same also on source and destination.
+Most aspects of the backend configuration can be changed at will,
+except for a few cases where the backend features influence frontend
+device feature exposure. But that is not relevant for this section.
+
+I am going to list the number of combinations that we can have. Let's
+start with the trivial ones, QEMU is the same on source and
+destination:
+
+1 - qemu-5.2 -M pc-5.2 -> migrates to -> qemu-5.2 -M pc-5.2
+
+ This is the latest QEMU with the latest machine type.
+ This have to work, and if it doesn't work it is a bug.
+
+2 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1
+
+ Exactly the same case than the previous one, but for 5.1.
+ Nothing to see here either.
+
+This are the easiest ones, we will not talk more about them in this
+section.
+
+Now we start with the more interesting cases. Consider the case where
+we have the same QEMU version in both sides (qemu-5.2) but we are using
+the latest machine type for that version (pc-5.2) but one of an older
+QEMU version, in this case pc-5.1.
+
+3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1
+
+ It needs to use the definition of pc-5.1 and the devices as they
+ were configured on 5.1, but this should be easy in the sense that
+ both sides are the same QEMU and both sides have exactly the same
+ idea of what the pc-5.1 machine is.
+
+4 - qemu-5.1 -M pc-5.2 -> migrates to -> qemu-5.1 -M pc-5.2
+
+ This combination is not possible as the qemu-5.1 doesn't understand
+ pc-5.2 machine type. So nothing to worry here.
+
+Now it comes the interesting ones, when both QEMU processes are
+different. Notice also that the machine type needs to be pc-5.1,
+because we have the limitation than qemu-5.1 doesn't know pc-5.2. So
+the possible cases are:
+
+5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1
+
+ This migration is known as newer to older. We need to make sure
+ when we are developing 5.2 we need to take care about not to break
+ migration to qemu-5.1. Notice that we can't make updates to
+ qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is
+ in qemu-5.2 side to make the relevant changes.
+
+6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1
+
+ This migration is known as older to newer. We need to make sure
+ than we are able to receive migrations from qemu-5.1. The problem is
+ similar to the previous one.
+
+If qemu-5.1 and qemu-5.2 were the same, there will not be any
+compatibility problems. But the reason that we create qemu-5.2 is to
+get new features, devices, defaults, etc.
+
+If we get a device that has a new feature, or change a default value,
+we have a problem when we try to migrate between different QEMU
+versions.
+
+So we need a way to tell qemu-5.2 that when we are using machine type
+pc-5.1, it needs to **not** use the feature, to be able to migrate to
+real qemu-5.1.
+
+And the equivalent part when migrating from qemu-5.1 to qemu-5.2.
+qemu-5.2 has to expect that it is not going to get data for the new
+feature, because qemu-5.1 doesn't know about it.
+
+How do we tell QEMU about these device feature changes? In
+hw/core/machine.c:hw_compat_X_Y arrays.
+
+If we change a default value, we need to put back the old value on
+that array. And the device, during initialization needs to look at
+that array to see what value it needs to get for that feature. And
+what are we going to put in that array, the value of a property.
+
+To create a property for a device, we need to use one of the
+DEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the
+macros that exist. With it, we set the default value for that
+property, and that is what it is going to get in the latest released
+version. But if we want a different value for a previous version, we
+can change that in the hw_compat_X_Y arrays.
+
+hw_compat_X_Y is an array of registers that have the format:
+
+- name_device
+- name_property
+- value
+
+Let's see a practical example.
+
+In qemu-5.2 virtio-blk-device got multi queue support. This is a
+change that is not backward compatible. In qemu-5.1 it has one
+queue. In qemu-5.2 it has the same number of queues as the number of
+cpus in the system.
+
+When we are doing migration, if we migrate from a device that has 4
+queues to a device that have only one queue, we don't know where to
+put the extra information for the other 3 queues, and we fail
+migration.
+
+Similar problem when we migrate from qemu-5.1 that has only one queue
+to qemu-5.2, we only sent information for one queue, but destination
+has 4, and we have 3 queues that are not properly initialized and
+anything can happen.
+
+So, how can we address this problem. Easy, just convince qemu-5.2
+that when it is running pc-5.1, it needs to set the number of queues
+for virtio-blk-devices to 1.
+
+That way we fix the cases 5 and 6.
+
+5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1
+
+ qemu-5.2 -M pc-5.1 sets number of queues to be 1.
+ qemu-5.1 -M pc-5.1 expects number of queues to be 1.
+
+ correct. migration works.
+
+6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1
+
+ qemu-5.1 -M pc-5.1 sets number of queues to be 1.
+ qemu-5.2 -M pc-5.1 expects number of queues to be 1.
+
+ correct. migration works.
+
+And now the other interesting case, case 3. In this case we have:
+
+3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1
+
+ Here we have the same QEMU in both sides. So it doesn't matter a
+ lot if we have set the number of queues to 1 or not, because
+ they are the same.
+
+ WRONG!
+
+ Think what happens if we do one of this double migrations:
+
+ A -> migrates -> B -> migrates -> C
+
+ where:
+
+ A: qemu-5.1 -M pc-5.1
+ B: qemu-5.2 -M pc-5.1
+ C: qemu-5.2 -M pc-5.1
+
+ migration A -> B is case 6, so number of queues needs to be 1.
+
+ migration B -> C is case 3, so we don't care. But actually we
+ care because we haven't started the guest in qemu-5.2, it came
+ migrated from qemu-5.1. So to be in the safe place, we need to
+ always use number of queues 1 when we are using pc-5.1.
+
+Now, how was this done in reality? The following commit shows how it
+was done::
+
+ commit 9445e1e15e66c19e42bea942ba810db28052cd05
+ Author: Stefan Hajnoczi <stefanha@redhat.com>
+ Date: Tue Aug 18 15:33:47 2020 +0100
+
+ virtio-blk-pci: default num_queues to -smp N
+
+The relevant parts for migration are::
+
+ @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = {
+ #endif
+ DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0,
+ true),
+ - DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1),
+ + DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues,
+ + VIRTIO_BLK_AUTO_NUM_QUEUES),
+ DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256),
+
+It changes the default value of num_queues. But it fishes it for old
+machine types to have the right value::
+
+ @@ -31,6 +31,7 @@
+ GlobalProperty hw_compat_5_1[] = {
+ ...
+ + { "virtio-blk-device", "num-queues", "1"},
+ ...
+ };
+
+A device with different features on both sides
+----------------------------------------------
+
+Let's assume that we are using the same QEMU binary on both sides,
+just to make the things easier. But we have a device that has
+different features on both sides of the migration. That can be
+because the devices are different, because the kernel driver of both
+devices have different features, whatever.
+
+How can we get this to work with migration. The way to do that is
+"theoretically" easy. You have to get the features that the device
+has in the source of the migration. The features that the device has
+on the target of the migration, you get the intersection of the
+features of both sides, and that is the way that you should launch
+QEMU.
+
+Notice that this is not completely related to QEMU. The most
+important thing here is that this should be handled by the managing
+application that launches QEMU. If QEMU is configured correctly, the
+migration will succeed.
+
+That said, actually doing it is complicated. Almost all devices are
+bad at being able to be launched with only some features enabled.
+With one big exception: cpus.
+
+You can read the documentation for QEMU x86 cpu models here:
+
+https://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html
+
+See when they talk about migration they recommend that one chooses the
+newest cpu model that is supported for all cpus.
+
+Let's say that we have:
+
+Host A:
+
+Device X has the feature Y
+
+Host B:
+
+Device X has not the feature Y
+
+If we try to migrate without any care from host A to host B, it will
+fail because when migration tries to load the feature Y on
+destination, it will find that the hardware is not there.
+
+Doing this would be the equivalent of doing with cpus:
+
+Host A:
+
+$ qemu-system-x86_64 -cpu host
+
+Host B:
+
+$ qemu-system-x86_64 -cpu host
+
+When both hosts have different cpu features this is guaranteed to
+fail. Especially if Host B has less features than host A. If host A
+has less features than host B, sometimes it works. Important word of
+last sentence is "sometimes".
+
+So, forgetting about cpu models and continuing with the -cpu host
+example, let's see that the differences of the cpus is that Host A and
+B have the following features:
+
+Features: 'pcid' 'stibp' 'taa-no'
+Host A: X X
+Host B: X
+
+And we want to migrate between them, the way configure both QEMU cpu
+will be:
+
+Host A:
+
+$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off
+
+Host B:
+
+$ qemu-system-x86_64 -cpu host,taa-no=off
+
+And you would be able to migrate between them. It is responsibility
+of the management application or of the user to make sure that the
+configuration is correct. QEMU doesn't know how to look at this kind
+of features in general.
+
+Notice that we don't recommend to use -cpu host for migration. It is
+used in this example because it makes the example simpler.
+
+Other devices have worse control about individual features. If they
+want to be able to migrate between hosts that show different features,
+the device needs a way to configure which ones it is going to use.
+
+In this section we have considered that we are using the same QEMU
+binary in both sides of the migration. If we use different QEMU
+versions process, then we need to have into account all other
+differences and the examples become even more complicated.
+
+How to mitigate when we have a backward compatibility error
+-----------------------------------------------------------
+
+We broke migration for old machine types continuously during
+development. But as soon as we find that there is a problem, we fix
+it. The problem is what happens when we detect after we have done a
+release that something has gone wrong.
+
+Let see how it worked with one example.
+
+After the release of qemu-8.0 we found a problem when doing migration
+of the machine type pc-7.2.
+
+- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
+
+ This migration works
+
+- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
+
+ This migration works
+
+- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2
+
+ This migration fails
+
+- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
+
+ This migration fails
+
+So clearly something fails when migration between qemu-7.2 and
+qemu-8.0 with machine type pc-7.2. The error messages, and git bisect
+pointed to this commit.
+
+In qemu-8.0 we got this commit::
+
+ commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2
+ Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
+ Date: Thu Mar 2 13:37:02 2023 +0000
+
+ hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register
+
+
+The relevant bits of the commit for our example are this ones::
+
+ --- a/hw/pci/pcie_aer.c
+ +++ b/hw/pci/pcie_aer.c
+ @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
+
+ pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
+ PCI_ERR_UNC_SUPPORTED);
+ + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
+ + PCI_ERR_UNC_MASK_DEFAULT);
+ + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
+ + PCI_ERR_UNC_SUPPORTED);
+
+ pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
+ PCI_ERR_UNC_SEVERITY_DEFAULT);
+
+The patch changes how we configure PCI space for AER. But QEMU fails
+when the PCI space configuration is different between source and
+destination.
+
+The following commit shows how this got fixed::
+
+ commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f
+ Author: Leonardo Bras <leobras@redhat.com>
+ Date: Tue May 2 21:27:02 2023 -0300
+
+ hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0
+
+ [...]
+
+The relevant parts of the fix in QEMU are as follow:
+
+First, we create a new property for the device to be able to configure
+the old behaviour or the new behaviour::
+
+ diff --git a/hw/pci/pci.c b/hw/pci/pci.c
+ index 8a87ccc8b0..5153ad63d6 100644
+ --- a/hw/pci/pci.c
+ +++ b/hw/pci/pci.c
+ @@ -79,6 +79,8 @@ static Property pci_props[] = {
+ DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
+ failover_pair_id),
+ DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0),
+ + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
+ + QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
+ DEFINE_PROP_END_OF_LIST()
+ };
+
+Notice that we enable the feature for new machine types.
+
+Now we see how the fix is done. This is going to depend on what kind
+of breakage happens, but in this case it is quite simple::
+
+ diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
+ index 103667c368..374d593ead 100644
+ --- a/hw/pci/pcie_aer.c
+ +++ b/hw/pci/pcie_aer.c
+ @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
+ uint16_t offset,
+
+ pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
+ PCI_ERR_UNC_SUPPORTED);
+ - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
+ - PCI_ERR_UNC_MASK_DEFAULT);
+ - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
+ - PCI_ERR_UNC_SUPPORTED);
+ +
+ + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
+ + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
+ + PCI_ERR_UNC_MASK_DEFAULT);
+ + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
+ + PCI_ERR_UNC_SUPPORTED);
+ + }
+
+ pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
+ PCI_ERR_UNC_SEVERITY_DEFAULT);
+
+I.e. If the property bit is enabled, we configure it as we did for
+qemu-8.0. If the property bit is not set, we configure it as it was in 7.2.
+
+And now, everything that is missing is disabling the feature for old
+machine types::
+
+ diff --git a/hw/core/machine.c b/hw/core/machine.c
+ index 47a34841a5..07f763eb2e 100644
+ --- a/hw/core/machine.c
+ +++ b/hw/core/machine.c
+ @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
+ { "e1000e", "migrate-timadj", "off" },
+ { "virtio-mem", "x-early-migration", "false" },
+ { "migration", "x-preempt-pre-7-2", "true" },
+ + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
+ };
+ const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
+
+And now, when qemu-8.0.1 is released with this fix, all combinations
+are going to work as supposed.
+
+- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
+- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
+- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works)
+- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works)
+
+So the normality has been restored and everything is ok, no?
+
+Not really, now our matrix is much bigger. We started with the easy
+cases, migration from the same version to the same version always
+works:
+
+- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2
+- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2
+- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
+
+Now the interesting ones. When the QEMU processes versions are
+different. For the 1st set, their fail and we can do nothing, both
+versions are released and we can't change anything.
+
+- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2
+- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2
+
+This two are the ones that work. The whole point of making the
+change in qemu-8.0.1 release was to fix this issue:
+
+- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
+- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2
+
+But now we found that qemu-8.0 neither can migrate to qemu-7.2 not
+qemu-8.0.1.
+
+- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
+- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0 -M pc-7.2
+
+So, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
+anything except to qemu-8.0.
+
+Can we do better?
+
+Yeap. If we know that we are going to do this migration:
+
+- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2
+
+We can launch the appropriate devices with::
+
+ --device...,x-pci-e-err-unc-mask=on
+
+And now we can receive a migration from 8.0. And from now on, we can
+do that migration to new machine types if we remember to enable that
+property for pc-7.2. Notice that we need to remember, it is not
+enough to know that the source of the migration is qemu-8.0. Think of
+this example:
+
+$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
+
+In the second migration, the source is not qemu-8.0, but we still have
+that "problem" and have that property enabled. Notice that we need to
+continue having this mark/property until we have this machine
+rebooted. But it is not a normal reboot (that don't reload QEMU) we
+need the machine to poweroff/poweron on a fixed QEMU. And from now
+on we can use the proper real machine.
diff --git a/docs/devel/migration/dirty-limit.rst b/docs/devel/migration/dirty-limit.rst
new file mode 100644
index 0000000000..8f32329d5f
--- /dev/null
+++ b/docs/devel/migration/dirty-limit.rst
@@ -0,0 +1,71 @@
+Dirty limit
+===========
+
+The dirty limit, short for dirty page rate upper limit, is a new capability
+introduced in the 8.1 QEMU release that uses a new algorithm based on the KVM
+dirty ring to throttle down the guest during live migration.
+
+The algorithm framework is as follows:
+
+::
+
+ ------------------------------------------------------------------------------
+ main --------------> throttle thread ------------> PREPARE(1) <--------
+ thread \ | |
+ \ | |
+ \ V |
+ -\ CALCULATE(2) |
+ \ | |
+ \ | |
+ \ V |
+ \ SET PENALTY(3) -----
+ -\ |
+ \ |
+ \ V
+ -> virtual CPU thread -------> ACCEPT PENALTY(4)
+ ------------------------------------------------------------------------------
+
+When the qmp command qmp_set_vcpu_dirty_limit is called for the first time,
+the QEMU main thread starts the throttle thread. The throttle thread, once
+launched, executes the loop, which consists of three steps:
+
+ - PREPARE (1)
+
+ The entire work of PREPARE (1) is preparation for the second stage,
+ CALCULATE(2), as the name implies. It involves preparing the dirty
+ page rate value and the corresponding upper limit of the VM:
+ The dirty page rate is calculated via the KVM dirty ring mechanism,
+ which tells QEMU how many dirty pages a virtual CPU has had since the
+ last KVM_EXIT_DIRTY_RING_FULL exception; The dirty page rate upper
+ limit is specified by caller, therefore fetch it directly.
+
+ - CALCULATE (2)
+
+ Calculate a suitable sleep period for each virtual CPU, which will be
+ used to determine the penalty for the target virtual CPU. The
+ computation must be done carefully in order to reduce the dirty page
+ rate progressively down to the upper limit without oscillation. To
+ achieve this, two strategies are provided: the first is to add or
+ subtract sleep time based on the ratio of the current dirty page rate
+ to the limit, which is used when the current dirty page rate is far
+ from the limit; the second is to add or subtract a fixed time when
+ the current dirty page rate is close to the limit.
+
+ - SET PENALTY (3)
+
+ Set the sleep time for each virtual CPU that should be penalized based
+ on the results of the calculation supplied by step CALCULATE (2).
+
+After completing the three above stages, the throttle thread loops back
+to step PREPARE (1) until the dirty limit is reached.
+
+On the other hand, each virtual CPU thread reads the sleep duration and
+sleeps in the path of the KVM_EXIT_DIRTY_RING_FULL exception handler, that
+is ACCEPT PENALTY (4). Virtual CPUs tied with writing processes will
+obviously exit to the path and get penalized, whereas virtual CPUs involved
+with read processes will not.
+
+In summary, thanks to the KVM dirty ring technology, the dirty limit
+algorithm will restrict virtual CPUs as needed to keep their dirty page
+rate inside the limit. This leads to more steady reading performance during
+live migration and can aid in improving large guest responsiveness.
diff --git a/docs/devel/migration/features.rst b/docs/devel/migration/features.rst
new file mode 100644
index 0000000000..d5ca7b86d5
--- /dev/null
+++ b/docs/devel/migration/features.rst
@@ -0,0 +1,14 @@
+Migration features
+==================
+
+Migration has plenty of features to support different use cases.
+
+.. toctree::
+ :maxdepth: 2
+
+ postcopy
+ dirty-limit
+ vfio
+ virtio
+ mapped-ram
+ CPR
diff --git a/docs/devel/migration/index.rst b/docs/devel/migration/index.rst
new file mode 100644
index 0000000000..2aa294d631
--- /dev/null
+++ b/docs/devel/migration/index.rst
@@ -0,0 +1,13 @@
+Migration
+=========
+
+This is the main entry for QEMU migration documentations. It explains how
+QEMU live migration works.
+
+.. toctree::
+ :maxdepth: 2
+
+ main
+ features
+ compatibility
+ best-practices
diff --git a/docs/devel/migration.rst b/docs/devel/migration/main.rst
index 2401253482..54385a23e5 100644
--- a/docs/devel/migration.rst
+++ b/docs/devel/migration/main.rst
@@ -1,6 +1,6 @@
-=========
-Migration
-=========
+===================
+Migration framework
+===================
QEMU has code to load/save the state of the guest that it is running.
These are two complementary operations. Saving the state just does
@@ -28,6 +28,8 @@ the guest to be stopped. Typically the time that the guest is
unresponsive during live migration is the low hundred of milliseconds
(notice that this depends on a lot of things).
+.. contents::
+
Transports
==========
@@ -39,6 +41,11 @@ over any transport.
- exec migration: do the migration using the stdin/stdout through a process.
- fd migration: do the migration using a file descriptor that is
passed to QEMU. QEMU doesn't care how this file descriptor is opened.
+- file migration: do the migration using a file that is passed to QEMU
+ by path. A file offset option is supported to allow a management
+ application to add its own metadata to the start of the file without
+ QEMU interference. Note that QEMU does not flush cached file
+ data/metadata at the end of migration.
In addition, support is included for migration using RDMA, which
transports the page data using ``RDMA``, where the hardware takes care of
@@ -50,27 +57,6 @@ All these migration protocols use the same infrastructure to
save/restore state devices. This infrastructure is shared with the
savevm/loadvm functionality.
-Debugging
-=========
-
-The migration stream can be analyzed thanks to ``scripts/analyze-migration.py``.
-
-Example usage:
-
-.. code-block:: shell
-
- $ qemu-system-x86_64 -display none -monitor stdio
- (qemu) migrate "exec:cat > mig"
- (qemu) q
- $ ./scripts/analyze-migration.py -f mig
- {
- "ram (3)": {
- "section sizes": {
- "pc.ram": "0x0000000008000000",
- ...
-
-See also ``analyze-migration.py -h`` help for more options.
-
Common infrastructure
=====================
@@ -156,7 +142,7 @@ An example (from hw/input/pckbd.c)
.name = "pckbd",
.version_id = 3,
.minimum_version_id = 3,
- .fields = (VMStateField[]) {
+ .fields = (const VMStateField[]) {
VMSTATE_UINT8(write_cmd, KBDState),
VMSTATE_UINT8(status, KBDState),
VMSTATE_UINT8(mode, KBDState),
@@ -165,13 +151,17 @@ An example (from hw/input/pckbd.c)
}
};
-We are declaring the state with name "pckbd".
-The ``version_id`` is 3, and the fields are 4 uint8_t in a KBDState structure.
-We registered this with:
+We are declaring the state with name "pckbd". The ``version_id`` is
+3, and there are 4 uint8_t fields in the KBDState structure. We
+registered this ``VMSTATEDescription`` with one of the following
+functions. The first one will generate a device ``instance_id``
+different for each registration. Use the second one if you already
+have an id that is different for each instance of the device:
.. code:: c
- vmstate_register(NULL, 0, &vmstate_kbd, s);
+ vmstate_register_any(NULL, &vmstate_kbd, s);
+ vmstate_register(NULL, instance_id, &vmstate_kbd, s);
For devices that are ``qdev`` based, we can register the device in the class
init function:
@@ -288,7 +278,7 @@ Example:
.pre_save = ide_drive_pio_pre_save,
.post_load = ide_drive_pio_post_load,
.needed = ide_drive_pio_state_needed,
- .fields = (VMStateField[]) {
+ .fields = (const VMStateField[]) {
VMSTATE_INT32(req_nb_sectors, IDEState),
VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
vmstate_info_uint8, uint8_t),
@@ -306,11 +296,11 @@ Example:
.version_id = 3,
.minimum_version_id = 0,
.post_load = ide_drive_post_load,
- .fields = (VMStateField[]) {
+ .fields = (const VMStateField[]) {
.... several fields ....
VMSTATE_END_OF_LIST()
},
- .subsections = (const VMStateDescription*[]) {
+ .subsections = (const VMStateDescription * const []) {
&vmstate_ide_drive_pio_state,
NULL
}
@@ -389,19 +379,13 @@ Each version is associated with a series of fields saved. The ``save_state`` al
the state as the newer version. But ``load_state`` sometimes is able to
load state from an older version.
-You can see that there are several version fields:
+You can see that there are two version fields:
- ``version_id``: the maximum version_id supported by VMState for that device.
- ``minimum_version_id``: the minimum version_id that VMState is able to understand
for that device.
-- ``minimum_version_id_old``: For devices that were not able to port to vmstate, we can
- assign a function that knows how to read this old state. This field is
- ignored if there is no ``load_state_old`` handler.
-VMState is able to read versions from minimum_version_id to
-version_id. And the function ``load_state_old()`` (if present) is able to
-load state from minimum_version_id_old to minimum_version_id. This
-function is deprecated and will be removed when no more users are left.
+VMState is able to read versions from minimum_version_id to version_id.
There are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields,
e.g.
@@ -452,10 +436,10 @@ data doesn't match the stored device data well; it allows an
intermediate temporary structure to be populated with migration
data and then transferred to the main structure.
-If you use memory API functions that update memory layout outside
+If you use memory or portio_list API functions that update memory layout outside
initialization (i.e., in response to a guest action), this is a strong
indication that you need to call these functions in a ``post_load`` callback.
-Examples of such memory API functions are:
+Examples of such API functions are:
- memory_region_add_subregion()
- memory_region_del_subregion()
@@ -464,6 +448,8 @@ Examples of such memory API functions are:
- memory_region_set_enabled()
- memory_region_set_address()
- memory_region_set_alias_offset()
+ - portio_list_set_address()
+ - portio_list_set_enabled()
Iterative device migration
--------------------------
@@ -488,15 +474,17 @@ An iterative device must provide:
- A ``load_setup`` function that initialises the data structures on the
destination.
- - A ``save_live_pending`` function that is called repeatedly and must
- indicate how much more data the iterative data must save. The core
- migration code will use this to determine when to pause the CPUs
- and complete the migration.
+ - A ``state_pending_exact`` function that indicates how much more
+ data we must save. The core migration code will use this to
+ determine when to pause the CPUs and complete the migration.
+
+ - A ``state_pending_estimate`` function that indicates how much more
+ data we must save. When the estimated amount is smaller than the
+ threshold, we call ``state_pending_exact``.
- - A ``save_live_iterate`` function (called after ``save_live_pending``
- when there is significant data still to be sent). It should send
- a chunk of data until the point that stream bandwidth limits tell it
- to stop. Each call generates one section.
+ - A ``save_live_iterate`` function should send a chunk of data until
+ the point that stream bandwidth limits tell it to stop. Each call
+ generates one section.
- A ``save_live_complete_precopy`` function that must transmit the
last section for the device containing any remaining data.
@@ -592,292 +580,3 @@ path.
Return path - opened by main thread, written by main thread AND postcopy
thread (protected by rp_mutex)
-Postcopy
-========
-
-'Postcopy' migration is a way to deal with migrations that refuse to converge
-(or take too long to converge) its plus side is that there is an upper bound on
-the amount of migration traffic and time it takes, the down side is that during
-the postcopy phase, a failure of *either* side or the network connection causes
-the guest to be lost.
-
-In postcopy the destination CPUs are started before all the memory has been
-transferred, and accesses to pages that are yet to be transferred cause
-a fault that's translated by QEMU into a request to the source QEMU.
-
-Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
-doesn't finish in a given time the switch is made to postcopy.
-
-Enabling postcopy
------------------
-
-To enable postcopy, issue this command on the monitor (both source and
-destination) prior to the start of migration:
-
-``migrate_set_capability postcopy-ram on``
-
-The normal commands are then used to start a migration, which is still
-started in precopy mode. Issuing:
-
-``migrate_start_postcopy``
-
-will now cause the transition from precopy to postcopy.
-It can be issued immediately after migration is started or any
-time later on. Issuing it after the end of a migration is harmless.
-
-Blocktime is a postcopy live migration metric, intended to show how
-long the vCPU was in state of interruptible sleep due to pagefault.
-That metric is calculated both for all vCPUs as overlapped value, and
-separately for each vCPU. These values are calculated on destination
-side. To enable postcopy blocktime calculation, enter following
-command on destination monitor:
-
-``migrate_set_capability postcopy-blocktime on``
-
-Postcopy blocktime can be retrieved by query-migrate qmp command.
-postcopy-blocktime value of qmp command will show overlapped blocking
-time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
-time per vCPU.
-
-.. note::
- During the postcopy phase, the bandwidth limits set using
- ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
- the destination is waiting for).
-
-Postcopy device transfer
-------------------------
-
-Loading of device data may cause the device emulation to access guest RAM
-that may trigger faults that have to be resolved by the source, as such
-the migration stream has to be able to respond with page data *during* the
-device load, and hence the device data has to be read from the stream completely
-before the device load begins to free the stream up. This is achieved by
-'packaging' the device data into a blob that's read in one go.
-
-Source behaviour
-----------------
-
-Until postcopy is entered the migration stream is identical to normal
-precopy, except for the addition of a 'postcopy advise' command at
-the beginning, to tell the destination that postcopy might happen.
-When postcopy starts the source sends the page discard data and then
-forms the 'package' containing:
-
- - Command: 'postcopy listen'
- - The device state
-
- A series of sections, identical to the precopy streams device state stream
- containing everything except postcopiable devices (i.e. RAM)
- - Command: 'postcopy run'
-
-The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
-contents are formatted in the same way as the main migration stream.
-
-During postcopy the source scans the list of dirty pages and sends them
-to the destination without being requested (in much the same way as precopy),
-however when a page request is received from the destination, the dirty page
-scanning restarts from the requested location. This causes requested pages
-to be sent quickly, and also causes pages directly after the requested page
-to be sent quickly in the hope that those pages are likely to be used
-by the destination soon.
-
-Destination behaviour
----------------------
-
-Initially the destination looks the same as precopy, with a single thread
-reading the migration stream; the 'postcopy advise' and 'discard' commands
-are processed to change the way RAM is managed, but don't affect the stream
-processing.
-
-::
-
- ------------------------------------------------------------------------------
- 1 2 3 4 5 6 7
- main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
- thread | |
- | (page request)
- | \___
- v \
- listen thread: --- page -- page -- page -- page -- page --
-
- a b c
- ------------------------------------------------------------------------------
-
-- On receipt of ``CMD_PACKAGED`` (1)
-
- All the data associated with the package - the ( ... ) section in the diagram -
- is read into memory, and the main thread recurses into qemu_loadvm_state_main
- to process the contents of the package (2) which contains commands (3,6) and
- devices (4...)
-
-- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
-
- a new thread (a) is started that takes over servicing the migration stream,
- while the main thread carries on loading the package. It loads normal
- background page data (b) but if during a device load a fault happens (5)
- the returned page (c) is loaded by the listen thread allowing the main
- threads device load to carry on.
-
-- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
-
- letting the destination CPUs start running. At the end of the
- ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
- is no longer used by migration, while the listen thread carries on servicing
- page data until the end of migration.
-
-Postcopy states
----------------
-
-Postcopy moves through a series of states (see postcopy_state) from
-ADVISE->DISCARD->LISTEN->RUNNING->END
-
- - Advise
-
- Set at the start of migration if postcopy is enabled, even
- if it hasn't had the start command; here the destination
- checks that its OS has the support needed for postcopy, and performs
- setup to ensure the RAM mappings are suitable for later postcopy.
- The destination will fail early in migration at this point if the
- required OS support is not present.
- (Triggered by reception of POSTCOPY_ADVISE command)
-
- - Discard
-
- Entered on receipt of the first 'discard' command; prior to
- the first Discard being performed, hugepages are switched off
- (using madvise) to ensure that no new huge pages are created
- during the postcopy phase, and to cause any huge pages that
- have discards on them to be broken.
-
- - Listen
-
- The first command in the package, POSTCOPY_LISTEN, switches
- the destination state to Listen, and starts a new thread
- (the 'listen thread') which takes over the job of receiving
- pages off the migration stream, while the main thread carries
- on processing the blob. With this thread able to process page
- reception, the destination now 'sensitises' the RAM to detect
- any access to missing pages (on Linux using the 'userfault'
- system).
-
- - Running
-
- POSTCOPY_RUN causes the destination to synchronise all
- state and start the CPUs and IO devices running. The main
- thread now finishes processing the migration package and
- now carries on as it would for normal precopy migration
- (although it can't do the cleanup it would do as it
- finishes a normal migration).
-
- - End
-
- The listen thread can now quit, and perform the cleanup of migration
- state, the migration is now complete.
-
-Source side page maps
----------------------
-
-The source side keeps two bitmaps during postcopy; 'the migration bitmap'
-and 'unsent map'. The 'migration bitmap' is basically the same as in
-the precopy case, and holds a bit to indicate that page is 'dirty' -
-i.e. needs sending. During the precopy phase this is updated as the CPU
-dirties pages, however during postcopy the CPUs are stopped and nothing
-should dirty anything any more.
-
-The 'unsent map' is used for the transition to postcopy. It is a bitmap that
-has a bit cleared whenever a page is sent to the destination, however during
-the transition to postcopy mode it is combined with the migration bitmap
-to form a set of pages that:
-
- a) Have been sent but then redirtied (which must be discarded)
- b) Have not yet been sent - which also must be discarded to cause any
- transparent huge pages built during precopy to be broken.
-
-Note that the contents of the unsentmap are sacrificed during the calculation
-of the discard set and thus aren't valid once in postcopy. The dirtymap
-is still valid and is used to ensure that no page is sent more than once. Any
-request for a page that has already been sent is ignored. Duplicate requests
-such as this can happen as a page is sent at about the same time the
-destination accesses it.
-
-Postcopy with hugepages
------------------------
-
-Postcopy now works with hugetlbfs backed memory:
-
- a) The linux kernel on the destination must support userfault on hugepages.
- b) The huge-page configuration on the source and destination VMs must be
- identical; i.e. RAMBlocks on both sides must use the same page size.
- c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
- RAM if it doesn't have enough hugepages, triggering (b) to fail.
- Using ``-mem-prealloc`` enforces the allocation using hugepages.
- d) Care should be taken with the size of hugepage used; postcopy with 2MB
- hugepages works well, however 1GB hugepages are likely to be problematic
- since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
- and until the full page is transferred the destination thread is blocked.
-
-Postcopy with shared memory
----------------------------
-
-Postcopy migration with shared memory needs explicit support from the other
-processes that share memory and from QEMU. There are restrictions on the type of
-memory that userfault can support shared.
-
-The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
-(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
-for hugetlbfs which may be a problem in some configurations).
-
-The vhost-user code in QEMU supports clients that have Postcopy support,
-and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
-to support postcopy.
-
-The client needs to open a userfaultfd and register the areas
-of memory that it maps with userfault. The client must then pass the
-userfaultfd back to QEMU together with a mapping table that allows
-fault addresses in the clients address space to be converted back to
-RAMBlock/offsets. The client's userfaultfd is added to the postcopy
-fault-thread and page requests are made on behalf of the client by QEMU.
-QEMU performs 'wake' operations on the client's userfaultfd to allow it
-to continue after a page has arrived.
-
-.. note::
- There are two future improvements that would be nice:
- a) Some way to make QEMU ignorant of the addresses in the clients
- address space
- b) Avoiding the need for QEMU to perform ufd-wake calls after the
- pages have arrived
-
-Retro-fitting postcopy to existing clients is possible:
- a) A mechanism is needed for the registration with userfault as above,
- and the registration needs to be coordinated with the phases of
- postcopy. In vhost-user extra messages are added to the existing
- control channel.
- b) Any thread that can block due to guest memory accesses must be
- identified and the implication understood; for example if the
- guest memory access is made while holding a lock then all other
- threads waiting for that lock will also be blocked.
-
-Firmware
-========
-
-Migration migrates the copies of RAM and ROM, and thus when running
-on the destination it includes the firmware from the source. Even after
-resetting a VM, the old firmware is used. Only once QEMU has been restarted
-is the new firmware in use.
-
-- Changes in firmware size can cause changes in the required RAMBlock size
- to hold the firmware and thus migration can fail. In practice it's best
- to pad firmware images to convenient powers of 2 with plenty of space
- for growth.
-
-- Care should be taken with device emulation code so that newer
- emulation code can work with older firmware to allow forward migration.
-
-- Care should be taken with newer firmware so that backward migration
- to older systems with older device emulation code will work.
-
-In some cases it may be best to tie specific firmware versions to specific
-versioned machine types to cut down on the combinations that will need
-support. This is also useful when newer versions of firmware outgrow
-the padding.
-
diff --git a/docs/devel/migration/mapped-ram.rst b/docs/devel/migration/mapped-ram.rst
new file mode 100644
index 0000000000..fa4cefd9fc
--- /dev/null
+++ b/docs/devel/migration/mapped-ram.rst
@@ -0,0 +1,138 @@
+Mapped-ram
+==========
+
+Mapped-ram is a new stream format for the RAM section designed to
+supplement the existing ``file:`` migration and make it compatible
+with ``multifd``. This enables parallel migration of a guest's RAM to
+a file.
+
+The core of the feature is to ensure that RAM pages are mapped
+directly to offsets in the resulting migration file. This enables the
+``multifd`` threads to write exclusively to those offsets even if the
+guest is constantly dirtying pages (i.e. live migration). Another
+benefit is that the resulting file will have a bounded size, since
+pages which are dirtied multiple times will always go to a fixed
+location in the file, rather than constantly being added to a
+sequential stream. Having the pages at fixed offsets also allows the
+usage of O_DIRECT for save/restore of the migration stream as the
+pages are ensured to be written respecting O_DIRECT alignment
+restrictions (direct-io support not yet implemented).
+
+Usage
+-----
+
+On both source and destination, enable the ``multifd`` and
+``mapped-ram`` capabilities:
+
+ ``migrate_set_capability multifd on``
+
+ ``migrate_set_capability mapped-ram on``
+
+Use a ``file:`` URL for migration:
+
+ ``migrate file:/path/to/migration/file``
+
+Mapped-ram migration is best done non-live, i.e. by stopping the VM on
+the source side before migrating.
+
+Use-cases
+---------
+
+The mapped-ram feature was designed for use cases where the migration
+stream will be directed to a file in the filesystem and not
+immediately restored on the destination VM [#]_. These could be
+thought of as snapshots. We can further categorize them into live and
+non-live.
+
+- Non-live snapshot
+
+If the use case requires a VM to be stopped before taking a snapshot,
+that's the ideal scenario for mapped-ram migration. Not having to
+track dirty pages, the migration will write the RAM pages to the disk
+as fast as it can.
+
+Note: if a snapshot is taken of a running VM, but the VM will be
+stopped after the snapshot by the admin, then consider stopping it
+right before the snapshot to take benefit of the performance gains
+mentioned above.
+
+- Live snapshot
+
+If the use case requires that the VM keeps running during and after
+the snapshot operation, then mapped-ram migration can still be used,
+but will be less performant. Other strategies such as
+background-snapshot should be evaluated as well. One benefit of
+mapped-ram in this scenario is portability since background-snapshot
+depends on async dirty tracking (KVM_GET_DIRTY_LOG) which is not
+supported outside of Linux.
+
+.. [#] While this same effect could be obtained with the usage of
+ snapshots or the ``file:`` migration alone, mapped-ram provides
+ a performance increase for VMs with larger RAM sizes (10s to
+ 100s of GiBs), specially if the VM has been stopped beforehand.
+
+RAM section format
+------------------
+
+Instead of having a sequential stream of pages that follow the
+RAMBlock headers, the dirty pages for a RAMBlock follow its header
+instead. This ensures that each RAM page has a fixed offset in the
+resulting migration file.
+
+A bitmap is introduced to track which pages have been written in the
+migration file. Pages are written at a fixed location for every
+ramblock. Zero pages are ignored as they'd be zero in the destination
+migration as well.
+
+::
+
+ Without mapped-ram: With mapped-ram:
+
+ --------------------- --------------------------------
+ | ramblock 1 header | | ramblock 1 header |
+ --------------------- --------------------------------
+ | ramblock 2 header | | ramblock 1 mapped-ram header |
+ --------------------- --------------------------------
+ | ... | | padding to next 1MB boundary |
+ --------------------- | ... |
+ | ramblock n header | --------------------------------
+ --------------------- | ramblock 1 pages |
+ | RAM_SAVE_FLAG_EOS | | ... |
+ --------------------- --------------------------------
+ | stream of pages | | ramblock 2 header |
+ | (iter 1) | --------------------------------
+ | ... | | ramblock 2 mapped-ram header |
+ --------------------- --------------------------------
+ | RAM_SAVE_FLAG_EOS | | padding to next 1MB boundary |
+ --------------------- | ... |
+ | stream of pages | --------------------------------
+ | (iter 2) | | ramblock 2 pages |
+ | ... | | ... |
+ --------------------- --------------------------------
+ | ... | | ... |
+ --------------------- --------------------------------
+ | RAM_SAVE_FLAG_EOS |
+ --------------------------------
+ | ... |
+ --------------------------------
+
+where:
+ - ramblock header: the generic information for a ramblock, such as
+ idstr, used_len, etc.
+
+ - ramblock mapped-ram header: the information added by this feature:
+ bitmap of pages written, bitmap size and offset of pages in the
+ migration file.
+
+Restrictions
+------------
+
+Since pages are written to their relative offsets and out of order
+(due to the memory dirtying patterns), streaming channels such as
+sockets are not supported. A seekable channel such as a file is
+required. This can be verified in the QIOChannel by the presence of
+the QIO_CHANNEL_FEATURE_SEEKABLE.
+
+The improvements brought by this feature apply only to guest physical
+RAM. Other types of memory such as VRAM are migrated as part of device
+states.
diff --git a/docs/devel/migration/postcopy.rst b/docs/devel/migration/postcopy.rst
new file mode 100644
index 0000000000..6c51e96d79
--- /dev/null
+++ b/docs/devel/migration/postcopy.rst
@@ -0,0 +1,313 @@
+========
+Postcopy
+========
+
+.. contents::
+
+'Postcopy' migration is a way to deal with migrations that refuse to converge
+(or take too long to converge) its plus side is that there is an upper bound on
+the amount of migration traffic and time it takes, the down side is that during
+the postcopy phase, a failure of *either* side causes the guest to be lost.
+
+In postcopy the destination CPUs are started before all the memory has been
+transferred, and accesses to pages that are yet to be transferred cause
+a fault that's translated by QEMU into a request to the source QEMU.
+
+Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
+doesn't finish in a given time the switch is made to postcopy.
+
+Enabling postcopy
+=================
+
+To enable postcopy, issue this command on the monitor (both source and
+destination) prior to the start of migration:
+
+``migrate_set_capability postcopy-ram on``
+
+The normal commands are then used to start a migration, which is still
+started in precopy mode. Issuing:
+
+``migrate_start_postcopy``
+
+will now cause the transition from precopy to postcopy.
+It can be issued immediately after migration is started or any
+time later on. Issuing it after the end of a migration is harmless.
+
+Blocktime is a postcopy live migration metric, intended to show how
+long the vCPU was in state of interruptible sleep due to pagefault.
+That metric is calculated both for all vCPUs as overlapped value, and
+separately for each vCPU. These values are calculated on destination
+side. To enable postcopy blocktime calculation, enter following
+command on destination monitor:
+
+``migrate_set_capability postcopy-blocktime on``
+
+Postcopy blocktime can be retrieved by query-migrate qmp command.
+postcopy-blocktime value of qmp command will show overlapped blocking
+time for all vCPU, postcopy-vcpu-blocktime will show list of blocking
+time per vCPU.
+
+.. note::
+ During the postcopy phase, the bandwidth limits set using
+ ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
+ the destination is waiting for).
+
+Postcopy internals
+==================
+
+State machine
+-------------
+
+Postcopy moves through a series of states (see postcopy_state) from
+ADVISE->DISCARD->LISTEN->RUNNING->END
+
+ - Advise
+
+ Set at the start of migration if postcopy is enabled, even
+ if it hasn't had the start command; here the destination
+ checks that its OS has the support needed for postcopy, and performs
+ setup to ensure the RAM mappings are suitable for later postcopy.
+ The destination will fail early in migration at this point if the
+ required OS support is not present.
+ (Triggered by reception of POSTCOPY_ADVISE command)
+
+ - Discard
+
+ Entered on receipt of the first 'discard' command; prior to
+ the first Discard being performed, hugepages are switched off
+ (using madvise) to ensure that no new huge pages are created
+ during the postcopy phase, and to cause any huge pages that
+ have discards on them to be broken.
+
+ - Listen
+
+ The first command in the package, POSTCOPY_LISTEN, switches
+ the destination state to Listen, and starts a new thread
+ (the 'listen thread') which takes over the job of receiving
+ pages off the migration stream, while the main thread carries
+ on processing the blob. With this thread able to process page
+ reception, the destination now 'sensitises' the RAM to detect
+ any access to missing pages (on Linux using the 'userfault'
+ system).
+
+ - Running
+
+ POSTCOPY_RUN causes the destination to synchronise all
+ state and start the CPUs and IO devices running. The main
+ thread now finishes processing the migration package and
+ now carries on as it would for normal precopy migration
+ (although it can't do the cleanup it would do as it
+ finishes a normal migration).
+
+ - Paused
+
+ Postcopy can run into a paused state (normally on both sides when
+ happens), where all threads will be temporarily halted mostly due to
+ network errors. When reaching paused state, migration will make sure
+ the qemu binary on both sides maintain the data without corrupting
+ the VM. To continue the migration, the admin needs to fix the
+ migration channel using the QMP command 'migrate-recover' on the
+ destination node, then resume the migration using QMP command 'migrate'
+ again on source node, with resume=true flag set.
+
+ - End
+
+ The listen thread can now quit, and perform the cleanup of migration
+ state, the migration is now complete.
+
+Device transfer
+---------------
+
+Loading of device data may cause the device emulation to access guest RAM
+that may trigger faults that have to be resolved by the source, as such
+the migration stream has to be able to respond with page data *during* the
+device load, and hence the device data has to be read from the stream completely
+before the device load begins to free the stream up. This is achieved by
+'packaging' the device data into a blob that's read in one go.
+
+Source behaviour
+----------------
+
+Until postcopy is entered the migration stream is identical to normal
+precopy, except for the addition of a 'postcopy advise' command at
+the beginning, to tell the destination that postcopy might happen.
+When postcopy starts the source sends the page discard data and then
+forms the 'package' containing:
+
+ - Command: 'postcopy listen'
+ - The device state
+
+ A series of sections, identical to the precopy streams device state stream
+ containing everything except postcopiable devices (i.e. RAM)
+ - Command: 'postcopy run'
+
+The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
+contents are formatted in the same way as the main migration stream.
+
+During postcopy the source scans the list of dirty pages and sends them
+to the destination without being requested (in much the same way as precopy),
+however when a page request is received from the destination, the dirty page
+scanning restarts from the requested location. This causes requested pages
+to be sent quickly, and also causes pages directly after the requested page
+to be sent quickly in the hope that those pages are likely to be used
+by the destination soon.
+
+Destination behaviour
+---------------------
+
+Initially the destination looks the same as precopy, with a single thread
+reading the migration stream; the 'postcopy advise' and 'discard' commands
+are processed to change the way RAM is managed, but don't affect the stream
+processing.
+
+::
+
+ ------------------------------------------------------------------------------
+ 1 2 3 4 5 6 7
+ main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
+ thread | |
+ | (page request)
+ | \___
+ v \
+ listen thread: --- page -- page -- page -- page -- page --
+
+ a b c
+ ------------------------------------------------------------------------------
+
+- On receipt of ``CMD_PACKAGED`` (1)
+
+ All the data associated with the package - the ( ... ) section in the diagram -
+ is read into memory, and the main thread recurses into qemu_loadvm_state_main
+ to process the contents of the package (2) which contains commands (3,6) and
+ devices (4...)
+
+- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
+
+ a new thread (a) is started that takes over servicing the migration stream,
+ while the main thread carries on loading the package. It loads normal
+ background page data (b) but if during a device load a fault happens (5)
+ the returned page (c) is loaded by the listen thread allowing the main
+ threads device load to carry on.
+
+- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
+
+ letting the destination CPUs start running. At the end of the
+ ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
+ is no longer used by migration, while the listen thread carries on servicing
+ page data until the end of migration.
+
+Source side page bitmap
+-----------------------
+
+The 'migration bitmap' in postcopy is basically the same as in the precopy,
+where each of the bit to indicate that page is 'dirty' - i.e. needs
+sending. During the precopy phase this is updated as the CPU dirties
+pages, however during postcopy the CPUs are stopped and nothing should
+dirty anything any more. Instead, dirty bits are cleared when the relevant
+pages are sent during postcopy.
+
+Postcopy features
+=================
+
+Postcopy recovery
+-----------------
+
+Comparing to precopy, postcopy is special on error handlings. When any
+error happens (in this case, mostly network errors), QEMU cannot easily
+fail a migration because VM data resides in both source and destination
+QEMU instances. On the other hand, when issue happens QEMU on both sides
+will go into a paused state. It'll need a recovery phase to continue a
+paused postcopy migration.
+
+The recovery phase normally contains a few steps:
+
+ - When network issue occurs, both QEMU will go into PAUSED state
+
+ - When the network is recovered (or a new network is provided), the admin
+ can setup the new channel for migration using QMP command
+ 'migrate-recover' on destination node, preparing for a resume.
+
+ - On source host, the admin can continue the interrupted postcopy
+ migration using QMP command 'migrate' with resume=true flag set.
+
+ - After the connection is re-established, QEMU will continue the postcopy
+ migration on both sides.
+
+During a paused postcopy migration, the VM can logically still continue
+running, and it will not be impacted from any page access to pages that
+were already migrated to destination VM before the interruption happens.
+However, if any of the missing pages got accessed on destination VM, the VM
+thread will be halted waiting for the page to be migrated, it means it can
+be halted until the recovery is complete.
+
+The impact of accessing missing pages can be relevant to different
+configurations of the guest. For example, when with async page fault
+enabled, logically the guest can proactively schedule out the threads
+accessing missing pages.
+
+Postcopy with hugepages
+-----------------------
+
+Postcopy now works with hugetlbfs backed memory:
+
+ a) The linux kernel on the destination must support userfault on hugepages.
+ b) The huge-page configuration on the source and destination VMs must be
+ identical; i.e. RAMBlocks on both sides must use the same page size.
+ c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal
+ RAM if it doesn't have enough hugepages, triggering (b) to fail.
+ Using ``-mem-prealloc`` enforces the allocation using hugepages.
+ d) Care should be taken with the size of hugepage used; postcopy with 2MB
+ hugepages works well, however 1GB hugepages are likely to be problematic
+ since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
+ and until the full page is transferred the destination thread is blocked.
+
+Postcopy with shared memory
+---------------------------
+
+Postcopy migration with shared memory needs explicit support from the other
+processes that share memory and from QEMU. There are restrictions on the type of
+memory that userfault can support shared.
+
+The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
+(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
+for hugetlbfs which may be a problem in some configurations).
+
+The vhost-user code in QEMU supports clients that have Postcopy support,
+and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
+to support postcopy.
+
+The client needs to open a userfaultfd and register the areas
+of memory that it maps with userfault. The client must then pass the
+userfaultfd back to QEMU together with a mapping table that allows
+fault addresses in the clients address space to be converted back to
+RAMBlock/offsets. The client's userfaultfd is added to the postcopy
+fault-thread and page requests are made on behalf of the client by QEMU.
+QEMU performs 'wake' operations on the client's userfaultfd to allow it
+to continue after a page has arrived.
+
+.. note::
+ There are two future improvements that would be nice:
+ a) Some way to make QEMU ignorant of the addresses in the clients
+ address space
+ b) Avoiding the need for QEMU to perform ufd-wake calls after the
+ pages have arrived
+
+Retro-fitting postcopy to existing clients is possible:
+ a) A mechanism is needed for the registration with userfault as above,
+ and the registration needs to be coordinated with the phases of
+ postcopy. In vhost-user extra messages are added to the existing
+ control channel.
+ b) Any thread that can block due to guest memory accesses must be
+ identified and the implication understood; for example if the
+ guest memory access is made while holding a lock then all other
+ threads waiting for that lock will also be blocked.
+
+Postcopy preemption mode
+------------------------
+
+Postcopy preempt is a new capability introduced in 8.0 QEMU release, it
+allows urgent pages (those got page fault requested from destination QEMU
+explicitly) to be sent in a separate preempt channel, rather than queued in
+the background migration channel. Anyone who cares about latencies of page
+faults during a postcopy migration should enable this feature. By default,
+it's not enabled.
diff --git a/docs/devel/migration/vfio.rst b/docs/devel/migration/vfio.rst
new file mode 100644
index 0000000000..c49482eab6
--- /dev/null
+++ b/docs/devel/migration/vfio.rst
@@ -0,0 +1,208 @@
+=====================
+VFIO device migration
+=====================
+
+Migration of virtual machine involves saving the state for each device that
+the guest is running on source host and restoring this saved state on the
+destination host. This document details how saving and restoring of VFIO
+devices is done in QEMU.
+
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
+support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
+VFIO_DEVICE_FEATURE_MIGRATION ioctl.
+
+When pre-copy is supported, it's possible to further reduce downtime by
+enabling "switchover-ack" migration capability.
+VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
+and recommends that the initial bytes are sent and loaded in the destination
+before stopping the source VM. Enabling this migration capability will
+guarantee that and thus, can potentially reduce downtime even further.
+
+To support migration of multiple devices that might do P2P transactions between
+themselves, VFIO migration uAPI defines an intermediate P2P quiescent state.
+While in the P2P quiescent state, P2P DMA transactions cannot be initiated by
+the device, but the device can respond to incoming ones. Additionally, all
+outstanding P2P transactions are guaranteed to have been completed by the time
+the device enters this state.
+
+All the devices that support P2P migration are first transitioned to the P2P
+quiescent state and only then are they stopped or started. This makes migration
+safe P2P-wise, since starting and stopping the devices is not done atomically
+for all the devices together.
+
+Thus, multiple VFIO devices migration is allowed only if all the devices
+support P2P migration. Single VFIO device migration is allowed regardless of
+P2P migration support.
+
+A detailed description of the UAPI for VFIO device migration can be found in
+the comment for the ``vfio_device_mig_state`` structure in the header file
+linux-headers/linux/vfio.h.
+
+VFIO implements the device hooks for the iterative approach as follows:
+
+* A ``save_setup`` function that sets up migration on the source.
+
+* A ``load_setup`` function that sets the VFIO device on the destination in
+ _RESUMING state.
+
+* A ``state_pending_estimate`` function that reports an estimate of the
+ remaining pre-copy data that the vendor driver has yet to save for the VFIO
+ device.
+
+* A ``state_pending_exact`` function that reads pending_bytes from the vendor
+ driver, which indicates the amount of data that the vendor driver has yet to
+ save for the VFIO device.
+
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+ active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+ vendor driver during iterative pre-copy phase.
+
+* A ``switchover_ack_needed`` function that checks if the VFIO device uses
+ "switchover-ack" migration capability when this capability is enabled.
+
+* A ``save_state`` function to save the device config space if it is present.
+
+* A ``save_live_complete_precopy`` function that sets the VFIO device in
+ _STOP_COPY state and iteratively copies the data for the VFIO device until
+ the vendor driver indicates that no data remains.
+
+* A ``load_state`` function that loads the config section and the data
+ sections that are generated by the save functions above.
+
+* ``cleanup`` functions for both save and load that perform any migration
+ related cleanup.
+
+
+The VFIO migration code uses a VM state change handler to change the VFIO
+device state when the VM state changes from running to not-running, and
+vice versa.
+
+Similarly, a migration state change handler is used to trigger a transition of
+the VFIO device state when certain changes of the migration state occur. For
+example, the VFIO device state is transitioned back to _RUNNING in case a
+migration failed or was canceled.
+
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_global_start`` and ``log_global_stop`` memory listener callback informs
+the VFIO dirty tracking module to start and stop dirty page tracking. A
+``log_sync`` memory listener callback queries the dirty page bitmap from the
+dirty tracking module and marks system memory pages which were DMA-ed by the
+VFIO device as dirty. The dirty page bitmap is queried per container.
+
+Currently there are two ways dirty page tracking can be done:
+(1) Device dirty tracking:
+In this method the device is responsible to log and report its DMAs. This
+method can be used only if the device is capable of tracking its DMAs.
+Discovering device capability, starting and stopping dirty tracking, and
+syncing the dirty bitmaps from the device are done using the DMA logging uAPI.
+More info about the uAPI can be found in the comments of the
+``vfio_device_feature_dma_logging_control`` and
+``vfio_device_feature_dma_logging_report`` structures in the header file
+linux-headers/linux/vfio.h.
+
+(2) VFIO IOMMU module:
+In this method dirty tracking is done by IOMMU. However, there is currently no
+IOMMU support for dirty page tracking. For this reason, all pages are
+perpetually marked dirty, unless the device driver pins pages through external
+APIs in which case only those pinned pages are perpetually marked dirty.
+
+If the above two methods are not supported, all pages are perpetually marked
+dirty by QEMU.
+
+By default, dirty pages are tracked during pre-copy as well as stop-and-copy
+phase. So, a page marked as dirty will be copied to the destination in both
+phases. Copying dirty pages in pre-copy phase helps QEMU to predict if it can
+achieve its downtime tolerances. If QEMU during pre-copy phase keeps finding
+dirty pages continuously, then it understands that even in stop-and-copy phase,
+it is likely to find dirty pages and can predict the downtime accordingly.
+
+QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
+which disables querying the dirty bitmap during pre-copy phase. If it is set to
+off, all dirty pages will be copied to the destination in stop-and-copy phase
+only.
+
+System memory dirty pages tracking when vIOMMU is enabled
+---------------------------------------------------------
+
+With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
+phase of migration. In that case, the unmap ioctl returns any dirty pages in
+that range and QEMU reports corresponding guest physical pages dirty. During
+stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
+pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
+mapped ranges. If device dirty tracking is enabled with vIOMMU, live migration
+will be blocked.
+
+Flow of state changes during Live migration
+===========================================
+
+Below is the state change flow during live migration for a VFIO device that
+supports both precopy and P2P migration. The flow for devices that don't
+support it is similar, except that the relevant states for precopy and P2P are
+skipped.
+The values in the parentheses represent the VM state, the migration state, and
+the VFIO device state, respectively.
+
+Live migration save path
+------------------------
+
+::
+
+ QEMU normal running state
+ (RUNNING, _NONE, _RUNNING)
+ |
+ migrate_init spawns migration_thread
+ Migration thread then calls each device's .save_setup()
+ (RUNNING, _SETUP, _PRE_COPY)
+ |
+ (RUNNING, _ACTIVE, _PRE_COPY)
+ If device is active, get pending_bytes by .state_pending_{estimate,exact}()
+ If total pending_bytes >= threshold_size, call .save_live_iterate()
+ Data of VFIO device for pre-copy phase is copied
+ Iterate till total pending bytes converge and are less than threshold
+ |
+ On migration completion, the vCPUs and the VFIO device are stopped
+ The VFIO device is first put in P2P quiescent state
+ (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P)
+ |
+ Then the VFIO device is put in _STOP_COPY state
+ (FINISH_MIGRATE, _ACTIVE, _STOP_COPY)
+ .save_live_complete_precopy() is called for each active device
+ For the VFIO device, iterate in .save_live_complete_precopy() until
+ pending data is 0
+ |
+ (POSTMIGRATE, _COMPLETED, _STOP_COPY)
+ Migraton thread schedules cleanup bottom half and exits
+ |
+ .save_cleanup() is called
+ (POSTMIGRATE, _COMPLETED, _STOP)
+
+Live migration resume path
+--------------------------
+
+::
+
+ Incoming migration calls .load_setup() for each device
+ (RESTORE_VM, _ACTIVE, _STOP)
+ |
+ For each device, .load_state() is called for that device section data
+ (RESTORE_VM, _ACTIVE, _RESUMING)
+ |
+ At the end, .load_cleanup() is called for each device and vCPUs are started
+ The VFIO device is first put in P2P quiescent state
+ (RUNNING, _ACTIVE, _RUNNING_P2P)
+ |
+ (RUNNING, _NONE, _RUNNING)
+
+Postcopy
+========
+
+Postcopy migration is currently not supported for VFIO devices.
diff --git a/docs/devel/migration/virtio.rst b/docs/devel/migration/virtio.rst
new file mode 100644
index 0000000000..611a18b821
--- /dev/null
+++ b/docs/devel/migration/virtio.rst
@@ -0,0 +1,115 @@
+=======================
+Virtio device migration
+=======================
+
+Copyright 2015 IBM Corp.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later. See
+the COPYING file in the top-level directory.
+
+Saving and restoring the state of virtio devices is a bit of a twisty maze,
+for several reasons:
+
+- state is distributed between several parts:
+
+ - virtio core, for common fields like features, number of queues, ...
+
+ - virtio transport (pci, ccw, ...), for the different proxy devices and
+ transport specific state (msix vectors, indicators, ...)
+
+ - virtio device (net, blk, ...), for the different device types and their
+ state (mac address, request queue, ...)
+
+- most fields are saved via the stream interface; subsequently, subsections
+ have been added to make cross-version migration possible
+
+This file attempts to document the current procedure and point out some
+caveats.
+
+Save state procedure
+====================
+
+::
+
+ virtio core virtio transport virtio device
+ ----------- ---------------- -------------
+
+ save() function registered
+ via VMState wrapper on
+ device class
+ virtio_save() <----------
+ ------> save_config()
+ - save proxy device
+ - save transport-specific
+ device fields
+ - save common device
+ fields
+ - save common virtqueue
+ fields
+ ------> save_queue()
+ - save transport-specific
+ virtqueue fields
+ ------> save_device()
+ - save device-specific
+ fields
+ - save subsections
+ - device endianness,
+ if changed from
+ default endianness
+ - 64 bit features, if
+ any high feature bit
+ is set
+ - virtio-1 virtqueue
+ fields, if VERSION_1
+ is set
+
+Load state procedure
+====================
+
+::
+
+ virtio core virtio transport virtio device
+ ----------- ---------------- -------------
+
+ load() function registered
+ via VMState wrapper on
+ device class
+ virtio_load() <----------
+ ------> load_config()
+ - load proxy device
+ - load transport-specific
+ device fields
+ - load common device
+ fields
+ - load common virtqueue
+ fields
+ ------> load_queue()
+ - load transport-specific
+ virtqueue fields
+ - notify guest
+ ------> load_device()
+ - load device-specific
+ fields
+ - load subsections
+ - device endianness
+ - 64 bit features
+ - virtio-1 virtqueue
+ fields
+ - sanitize endianness
+ - sanitize features
+ - virtqueue index sanity
+ check
+ - feature-dependent setup
+
+Implications of this setup
+==========================
+
+Devices need to be careful in their state processing during load: The
+load_device() procedure is invoked by the core before subsections have
+been loaded. Any code that depends on information transmitted in subsections
+therefore has to be invoked in the device's load() function _after_
+virtio_load() returned (like e.g. code depending on features).
+
+Any extension of the state being migrated should be done in subsections
+added to the core for compatibility reasons. If transport or device specific
+state is added, core needs to invoke a callback from the new subsection.
diff --git a/docs/devel/modules.rst b/docs/devel/modules.rst
index 066f347b89..8e999c4fa4 100644
--- a/docs/devel/modules.rst
+++ b/docs/devel/modules.rst
@@ -1,5 +1,5 @@
============
-Qemu modules
+QEMU modules
============
.. kernel-doc:: include/qemu/module.h
diff --git a/docs/devel/multi-process.rst b/docs/devel/multi-process.rst
index 69699329d6..4ef539c0b0 100644
--- a/docs/devel/multi-process.rst
+++ b/docs/devel/multi-process.rst
@@ -1,15 +1,17 @@
-This is the design document for multi-process QEMU. It does not
-necessarily reflect the status of the current implementation, which
-may lack features or be considerably different from what is described
-in this document. This document is still useful as a description of
-the goals and general direction of this feature.
-
-Please refer to the following wiki for latest details:
-https://wiki.qemu.org/Features/MultiProcessQEMU
-
Multi-process QEMU
===================
+.. note::
+
+ This is the design document for multi-process QEMU. It does not
+ necessarily reflect the status of the current implementation, which
+ may lack features or be considerably different from what is described
+ in this document. This document is still useful as a description of
+ the goals and general direction of this feature.
+
+ Please refer to the following wiki for latest details:
+ https://wiki.qemu.org/Features/MultiProcessQEMU
+
QEMU is often used as the hypervisor for virtual machines running in the
Oracle cloud. Since one of the advantages of cloud computing is the
ability to run many VMs from different tenants in the same cloud
@@ -185,9 +187,9 @@ desired, in which the emulation application should only be allowed to
access the files or devices the VM it's running on behalf of can access.
#### qemu-io model
-Qemu-io is a test harness used to test changes to the QEMU block backend
-object code. (e.g., the code that implements disk images for disk driver
-emulation) Qemu-io is not a device emulation application per se, but it
+``qemu-io`` is a test harness used to test changes to the QEMU block backend
+object code (e.g., the code that implements disk images for disk driver
+emulation). ``qemu-io`` is not a device emulation application per se, but it
does compile the QEMU block objects into a separate binary from the main
QEMU one. This could be useful for disk device emulation, since its
emulation applications will need to include the QEMU block objects.
@@ -407,8 +409,9 @@ the initial messages sent to the emulation process is a guest memory
table. Each entry in this table consists of a file descriptor and size
that the emulation process can ``mmap()`` to directly access guest
memory, similar to ``vhost_user_set_mem_table()``. Note guest memory
-must be backed by file descriptors, such as when QEMU is given the
-*-mem-path* command line option.
+must be backed by shared file-backed memory, for example, using
+*-object memory-backend-file,share=on* and setting that memory backend
+as RAM for the machine.
IOMMU operations
^^^^^^^^^^^^^^^^
@@ -639,7 +642,7 @@ the CPU that issued the MMIO.
+==========+========================+
| rid | range MMIO is within |
+----------+------------------------+
-| offset | offset withing *rid* |
+| offset | offset within *rid* |
+----------+------------------------+
| type | e.g., load or store |
+----------+------------------------+
diff --git a/docs/devel/multi-thread-tcg.rst b/docs/devel/multi-thread-tcg.rst
index 5b446ee08b..1420789fff 100644
--- a/docs/devel/multi-thread-tcg.rst
+++ b/docs/devel/multi-thread-tcg.rst
@@ -109,6 +109,7 @@ including:
- debugging operations (breakpoint insertion/removal)
- some CPU helper functions
- linux-user spawning its first thread
+ - operations related to TCG Plugins
This is done with the async_safe_run_on_cpu() mechanism to ensure all
vCPUs are quiescent when changes are being made to shared global
@@ -226,10 +227,9 @@ instruction. This could be a future optimisation.
Emulated hardware state
-----------------------
-Currently thanks to KVM work any access to IO memory is automatically
-protected by the global iothread mutex, also known as the BQL (Big
-Qemu Lock). Any IO region that doesn't use global mutex is expected to
-do its own locking.
+Currently thanks to KVM work any access to IO memory is automatically protected
+by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected
+to do its own locking.
However IO memory isn't the only way emulated hardware state can be
modified. Some architectures have model specific registers that
diff --git a/docs/devel/multiple-iothreads.txt b/docs/devel/multiple-iothreads.txt
index aeb997bed5..de85767b12 100644
--- a/docs/devel/multiple-iothreads.txt
+++ b/docs/devel/multiple-iothreads.txt
@@ -5,7 +5,7 @@ the COPYING file in the top-level directory.
This document explains the IOThread feature and how to write code that runs
-outside the QEMU global mutex.
+outside the BQL.
The main loop and IOThreads
---------------------------
@@ -29,13 +29,13 @@ scalability bottleneck on hosts with many CPUs. Work can be spread across
several IOThreads instead of just one main loop. When set up correctly this
can improve I/O latency and reduce jitter seen by the guest.
-The main loop is also deeply associated with the QEMU global mutex, which is a
-scalability bottleneck in itself. vCPU threads and the main loop use the QEMU
-global mutex to serialize execution of QEMU code. This mutex is necessary
-because a lot of QEMU's code historically was not thread-safe.
+The main loop is also deeply associated with the BQL, which is a
+scalability bottleneck in itself. vCPU threads and the main loop use the BQL
+to serialize execution of QEMU code. This mutex is necessary because a lot of
+QEMU's code historically was not thread-safe.
The fact that all I/O processing is done in a single main loop and that the
-QEMU global mutex is contended by all vCPU threads and the main loop explain
+BQL is contended by all vCPU threads and the main loop explain
why it is desirable to place work into IOThreads.
The experimental virtio-blk data-plane implementation has been benchmarked and
@@ -61,19 +61,26 @@ There are several old APIs that use the main loop AioContext:
* LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
* LEGACY timer_new_ms() - create a timer
* LEGACY qemu_bh_new() - create a BH
+ * LEGACY qemu_bh_new_guarded() - create a BH with a device re-entrancy guard
* LEGACY qemu_aio_wait() - run an event loop iteration
Since they implicitly work on the main loop they cannot be used in code that
runs in an IOThread. They might cause a crash or deadlock if called from an
-IOThread since the QEMU global mutex is not held.
+IOThread since the BQL is not held.
Instead, use the AioContext functions directly (see include/block/aio.h):
* aio_set_fd_handler() - monitor a file descriptor
* aio_set_event_notifier() - monitor an event notifier
* aio_timer_new() - create a timer
* aio_bh_new() - create a BH
+ * aio_bh_new_guarded() - create a BH with a device re-entrancy guard
* aio_poll() - run an event loop iteration
+The qemu_bh_new_guarded/aio_bh_new_guarded APIs accept a "MemReentrancyGuard"
+argument, which is used to check for and prevent re-entrancy problems. For
+BHs associated with devices, the reentrancy-guard is contained in the
+corresponding DeviceState and named "mem_reentrancy_guard".
+
The AioContext can be obtained from the IOThread using
iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
Code that takes an AioContext argument works both in IOThreads or the main
@@ -81,27 +88,18 @@ loop, depending on which AioContext instance the caller passes in.
How to synchronize with an IOThread
-----------------------------------
-AioContext is not thread-safe so some rules must be followed when using file
-descriptors, event notifiers, timers, or BHs across threads:
-
-1. AioContext functions can always be called safely. They handle their
-own locking internally.
-
-2. Other threads wishing to access the AioContext must use
-aio_context_acquire()/aio_context_release() for mutual exclusion. Once the
-context is acquired no other thread can access it or run event loop iterations
-in this AioContext.
-
-Legacy code sometimes nests aio_context_acquire()/aio_context_release() calls.
-Do not use nesting anymore, it is incompatible with the BDRV_POLL_WHILE() macro
-used in the block layer and can lead to hangs.
+Variables that can be accessed by multiple threads require some form of
+synchronization such as qemu_mutex_lock(), rcu_read_lock(), etc.
-There is currently no lock ordering rule if a thread needs to acquire multiple
-AioContexts simultaneously. Therefore, it is only safe for code holding the
-QEMU global mutex to acquire other AioContexts.
+AioContext functions like aio_set_fd_handler(), aio_set_event_notifier(),
+aio_bh_new(), and aio_timer_new() are thread-safe. They can be used to trigger
+activity in an IOThread.
Side note: the best way to schedule a function call across threads is to call
-aio_bh_schedule_oneshot(). No acquire/release or locking is needed.
+aio_bh_schedule_oneshot().
+
+The main loop thread can wait synchronously for a condition using
+AIO_WAIT_WHILE().
AioContext and the block layer
------------------------------
@@ -109,7 +107,7 @@ The AioContext originates from the QEMU block layer, even though nowadays
AioContext is a generic event loop that can be used by any QEMU subsystem.
The block layer has support for AioContext integrated. Each BlockDriverState
-is associated with an AioContext using bdrv_try_set_aio_context() and
+is associated with an AioContext using bdrv_try_change_aio_context() and
bdrv_get_aio_context(). This allows block layer code to process I/O inside the
right AioContext. Other subsystems may wish to follow a similar approach.
@@ -117,22 +115,16 @@ Block layer code must therefore expect to run in an IOThread and avoid using
old APIs that implicitly use the main loop. See the "How to program for
IOThreads" above for information on how to do that.
-If main loop code such as a QMP function wishes to access a BlockDriverState
-it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
-that callbacks in the IOThread do not run in parallel.
-
Code running in the monitor typically needs to ensure that past
requests from the guest are completed. When a block device is running
in an IOThread, the IOThread can also process requests from the guest
(via ioeventfd). To achieve both objects, wrap the code between
bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained
-section". The functions must be called between aio_context_acquire()
-and aio_context_release(). You can freely release and re-acquire the
-AioContext within a drained section.
-
-Long-running jobs (usually in the form of coroutines) are best scheduled in
-the BlockDriverState's AioContext to avoid the need to acquire/release around
-each bdrv_*() call. The functions bdrv_add/remove_aio_context_notifier,
-or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends,
-can be used to get a notification whenever bdrv_try_set_aio_context() moves a
+section".
+
+Long-running jobs (usually in the form of coroutines) are often scheduled in
+the BlockDriverState's AioContext. The functions
+bdrv_add/remove_aio_context_notifier, or alternatively
+blk_add/remove_aio_context_notifier if you use BlockBackends, can be used to
+get a notification whenever bdrv_try_change_aio_context() moves a
BlockDriverState to a different AioContext.
diff --git a/docs/devel/nested-papr.txt b/docs/devel/nested-papr.txt
new file mode 100644
index 0000000000..90943650db
--- /dev/null
+++ b/docs/devel/nested-papr.txt
@@ -0,0 +1,119 @@
+Nested PAPR API (aka KVM on PowerVM)
+====================================
+
+This API aims at providing support to enable nested virtualization with
+KVM on PowerVM. While the existing support for nested KVM on PowerNV was
+introduced with cap-nested-hv option, however, with a slight design change,
+to enable this on papr/pseries, a new cap-nested-papr option is added. eg:
+
+ qemu-system-ppc64 -cpu POWER10 -machine pseries,cap-nested-papr=true ...
+
+Work by:
+ Michael Neuling <mikey@neuling.org>
+ Vaibhav Jain <vaibhav@linux.ibm.com>
+ Jordan Niethe <jniethe5@gmail.com>
+ Harsh Prateek Bora <harshpb@linux.ibm.com>
+ Shivaprasad G Bhat <sbhat@linux.ibm.com>
+ Kautuk Consul <kconsul@linux.vnet.ibm.com>
+
+Below taken from the kernel documentation:
+
+Introduction
+============
+
+This document explains how a guest operating system can act as a
+hypervisor and run nested guests through the use of hypercalls, if the
+hypervisor has implemented them. The terms L0, L1, and L2 are used to
+refer to different software entities. L0 is the hypervisor mode entity
+that would normally be called the "host" or "hypervisor". L1 is a
+guest virtual machine that is directly run under L0 and is initiated
+and controlled by L0. L2 is a guest virtual machine that is initiated
+and controlled by L1 acting as a hypervisor. A significant design change
+wrt existing API is that now the entire L2 state is maintained within L0.
+
+Existing Nested-HV API
+======================
+
+Linux/KVM has had support for Nesting as an L0 or L1 since 2018
+
+The L0 code was added::
+
+ commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce
+ Author: Paul Mackerras <paulus@ozlabs.org>
+ Date: Mon Oct 8 16:31:03 2018 +1100
+ KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
+
+The L1 code was added::
+
+ commit 360cae313702cdd0b90f82c261a8302fecef030a
+ Author: Paul Mackerras <paulus@ozlabs.org>
+ Date: Mon Oct 8 16:31:04 2018 +1100
+ KVM: PPC: Book3S HV: Nested guest entry via hypercall
+
+This API works primarily using a signal hcall h_enter_nested(). This
+call made by the L1 to tell the L0 to start an L2 vCPU with the given
+state. The L0 then starts this L2 and runs until an L2 exit condition
+is reached. Once the L2 exits, the state of the L2 is given back to
+the L1 by the L0. The full L2 vCPU state is always transferred from
+and to L1 when the L2 is run. The L0 doesn't keep any state on the L2
+vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2
+-> L1 exit).
+
+The only state kept by the L0 is the partition table. The L1 registers
+it's partition table using the h_set_partition_table() hcall. All
+other state held by the L0 about the L2s is cached state (such as
+shadow page tables).
+
+The L1 may run any L2 or vCPU without first informing the L0. It
+simply starts the vCPU using h_enter_nested(). The creation of L2s and
+vCPUs is done implicitly whenever h_enter_nested() is called.
+
+In this document, we call this existing API the v1 API.
+
+New PAPR API
+===============
+
+The new PAPR API changes from the v1 API such that the creating L2 and
+associated vCPUs is explicit. In this document, we call this the v2
+API.
+
+h_enter_nested() is replaced with H_GUEST_VCPU_RUN(). Before this can
+be called the L1 must explicitly create the L2 using h_guest_create()
+and any associated vCPUs() created with h_guest_create_vCPU(). Getting
+and setting vCPU state can also be performed using h_guest_{g|s}et
+hcall.
+
+The basic execution flow is for an L1 to create an L2, run it, and
+delete it is:
+
+- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES()
+ (normally at L1 boot time).
+
+- L1 requests the L0 to create an L2 with H_GUEST_CREATE() and receives a token
+
+- L1 requests the L0 to create an L2 vCPU with H_GUEST_CREATE_VCPU()
+
+- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall
+
+- L1 requests the L0 to run the vCPU using H_GUEST_RUN_VCPU() hcall
+
+- L1 deletes L2 with H_GUEST_DELETE()
+
+For more details, please refer:
+
+[1] Linux Kernel documentation (upstream documentation commit):
+
+commit 476652297f94a2e5e5ef29e734b0da37ade94110
+Author: Michael Neuling <mikey@neuling.org>
+Date: Thu Sep 14 13:06:00 2023 +1000
+
+ docs: powerpc: Document nested KVM on POWER
+
+ Document support for nested KVM on POWER using the existing API as well
+ as the new PAPR API. This includes the new HCALL interface and how it
+ used by KVM.
+
+ Signed-off-by: Michael Neuling <mikey@neuling.org>
+ Signed-off-by: Jordan Niethe <jniethe5@gmail.com>
+ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
+ Link: https://msgid.link/20230914030600.16993-12-jniethe5@gmail.com
diff --git a/docs/devel/pci.rst b/docs/devel/pci.rst
new file mode 100644
index 0000000000..68739334f3
--- /dev/null
+++ b/docs/devel/pci.rst
@@ -0,0 +1,8 @@
+=============
+PCI subsystem
+=============
+
+API Reference
+-------------
+
+.. kernel-doc:: include/hw/pci/pci.h
diff --git a/docs/devel/qapi-code-gen.rst b/docs/devel/qapi-code-gen.rst
index b2569de486..f453bd3546 100644
--- a/docs/devel/qapi-code-gen.rst
+++ b/docs/devel/qapi-code-gen.rst
@@ -41,8 +41,8 @@ used internally.
There are several kinds of types: simple types (a number of built-in
types, such as ``int`` and ``str``; as well as enumerations), arrays,
-complex types (structs and two flavors of unions), and alternate types
-(a choice between other types).
+complex types (structs and unions), and alternate types (a choice
+between other types).
Schema syntax
@@ -167,6 +167,7 @@ Syntax::
'*doc-required': BOOL,
'*command-name-exceptions': [ STRING, ... ],
'*command-returns-exceptions': [ STRING, ... ],
+ '*documentation-exceptions': [ STRING, ... ],
'*member-name-exceptions': [ STRING, ... ] } }
The pragma directive lets you control optional generator behavior.
@@ -183,6 +184,10 @@ may contain ``"_"`` instead of ``"-"``. Default is none.
Pragma 'command-returns-exceptions' takes a list of commands that may
violate the rules on permitted return types. Default is none.
+Pragma 'documentation-exceptions' takes a list of types, commands, and
+events whose members / arguments need not be documented. Default is
+none.
+
Pragma 'member-name-exceptions' takes a list of types whose member
names may contain uppercase letters, and ``"_"`` instead of ``"-"``.
Default is none.
@@ -200,7 +205,9 @@ Syntax::
'*if': COND,
'*features': FEATURES }
ENUM-VALUE = STRING
- | { 'name': STRING, '*if': COND }
+ | { 'name': STRING,
+ '*if': COND,
+ '*features': FEATURES }
Member 'enum' names the enum type.
@@ -543,7 +550,8 @@ Member 'allow-oob' declares whether the command supports out-of-band
{ 'command': 'migrate_recover',
'data': { 'uri': 'str' }, 'allow-oob': true }
-See qmp-spec.txt for out-of-band execution syntax and semantics.
+See the :doc:`/interop/qmp-spec` for out-of-band execution syntax
+and semantics.
Commands supporting out-of-band execution can still be executed
in-band.
@@ -591,7 +599,7 @@ blocking the guest and other background operations.
Coroutine safety can be hard to prove, similar to thread safety. Common
pitfalls are:
-- The global mutex isn't held across ``qemu_coroutine_yield()``, so
+- The BQL isn't held across ``qemu_coroutine_yield()``, so
operations that used to assume that they execute atomically may have
to be more careful to protect against changes in the global state.
@@ -683,9 +691,10 @@ change in the QMP syntax (usually by allowing values or operations
that previously resulted in an error). QMP clients may still need to
know whether the extension is available.
-For this purpose, a list of features can be specified for a command or
-struct type. Each list member can either be ``{ 'name': STRING, '*if':
-COND }``, or STRING, which is shorthand for ``{ 'name': STRING }``.
+For this purpose, a list of features can be specified for definitions,
+enumeration values, and struct members. Each feature list member can
+either be ``{ 'name': STRING, '*if': COND }``, or STRING, which is
+shorthand for ``{ 'name': STRING }``.
The optional 'if' member specifies a conditional. See `Configuring
the schema`_ below for more on this.
@@ -706,8 +715,14 @@ QEMU shows a certain behaviour.
Special features
~~~~~~~~~~~~~~~~
-Feature "deprecated" marks a command, event, or struct member as
-deprecated. It is not supported elsewhere so far.
+Feature "deprecated" marks a command, event, enum value, or struct
+member as deprecated. It is not supported elsewhere so far.
+Interfaces so marked may be withdrawn in future releases in accordance
+with QEMU's deprecation policy.
+
+Feature "unstable" marks a command, event, enum value, or struct
+member as unstable. It is not supported elsewhere so far. Interfaces
+so marked may be withdrawn or changed incompatibly in future releases.
Naming rules and reserved names
@@ -727,14 +742,14 @@ Types, commands, and events share a common namespace. Therefore,
generally speaking, type definitions should always use CamelCase for
user-defined type names, while built-in types are lowercase.
-Type names ending with ``Kind`` or ``List`` are reserved for the
-generator, which uses them for implicit union enums and array types,
-respectively.
+Type names ending with ``List`` are reserved for the generator, which
+uses them for array types.
-Command names, and member names within a type, should be all lower
-case with words separated by a hyphen. However, some existing older
-commands and complex types use underscore; when extending them,
-consistency is preferred over blindly avoiding underscore.
+Command names, member names within a type, and feature names should be
+all lower case with words separated by a hyphen. However, some
+existing older commands and complex types use underscore; when
+extending them, consistency is preferred over blindly avoiding
+underscore.
Event names should be ALL_CAPS with words separated by underscore.
@@ -742,9 +757,8 @@ Member name ``u`` and names starting with ``has-`` or ``has_`` are reserved
for the generator, which uses them for unions and for tracking
optional members.
-Any name (command, event, type, member, or enum value) beginning with
-``x-`` is marked experimental, and may be withdrawn or changed
-incompatibly in a future release.
+Names beginning with ``x-`` used to signify "experimental". This
+convention has been replaced by special feature "unstable".
Pragmas ``command-name-exceptions`` and ``member-name-exceptions`` let
you violate naming rules. Use for new code is strongly discouraged. See
@@ -796,9 +810,8 @@ gets its generated code guarded like this::
... generated code ...
#endif /* defined(HAVE_BAR) && defined(CONFIG_FOO) */
-Individual members of complex types, commands arguments, and
-event-specific data can also be made conditional. This requires the
-longhand form of MEMBER.
+Individual members of complex types can also be made conditional.
+This requires the longhand form of MEMBER.
Example: a struct type with unconditional member 'foo' and conditional
member 'bar' ::
@@ -809,8 +822,8 @@ member 'bar' ::
A union's discriminator may not be conditional.
-Likewise, individual enumeration values be conditional. This requires
-the longhand form of ENUM-VALUE_.
+Likewise, individual enumeration values may be conditional. This
+requires the longhand form of ENUM-VALUE_.
Example: an enum type with unconditional value 'foo' and conditional
value 'bar' ::
@@ -916,14 +929,17 @@ first character of the first line.
The usual ****strong****, *\*emphasized\** and ````literal```` markup
should be used. If you need a single literal ``*``, you will need to
-backslash-escape it. As an extension beyond the usual rST syntax, you
-can also use ``@foo`` to reference a name in the schema; this is rendered
-the same way as ````foo````.
+backslash-escape it.
+
+Use ``@foo`` to reference a name in the schema. This is an rST
+extension. It is rendered the same way as ````foo````, but carries
+additional meaning.
Example::
##
# Some text foo with **bold** and *emphasis*
+ #
# 1. with a list
# 2. like that
#
@@ -936,6 +952,11 @@ Example::
# <- get that
##
+For legibility, wrap text paragraphs so every line is at most 70
+characters long.
+
+Separate sentences with two spaces.
+
Definition documentation
~~~~~~~~~~~~~~~~~~~~~~~~
@@ -949,59 +970,54 @@ definition must have documentation.
Definition documentation starts with a line naming the definition,
followed by an optional overview, a description of each argument (for
commands and events), member (for structs and unions), branch (for
-alternates), or value (for enums), and finally optional tagged
-sections.
-
-Descriptions of arguments can span multiple lines. The description
-text can start on the line following the '\@argname:', in which case it
-must not be indented at all. It can also start on the same line as
-the '\@argname:'. In this case if it spans multiple lines then second
-and subsequent lines must be indented to line up with the first
-character of the first line of the description::
-
- # @argone:
- # This is a two line description
- # in the first style.
- #
- # @argtwo: This is a two line description
- # in the second style.
+alternates), or value (for enums), a description of each feature (if
+any), and finally optional tagged sections.
-The number of spaces between the ':' and the text is not significant.
+Descriptions start with '\@name:'. The description text must be
+indented like this::
-.. admonition:: FIXME
+ # @name: Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed
+ # do eiusmod tempor incididunt ut labore et dolore magna aliqua.
- The parser accepts these things in almost any order.
+.. FIXME The parser accepts these things in almost any order.
-.. admonition:: FIXME
-
- union branches should be described, too.
+.. FIXME union branches should be described, too.
Extensions added after the definition was first released carry a
-'(since x.y.z)' comment.
+"(since x.y.z)" comment.
-A tagged section starts with one of the following words:
-"Note:"/"Notes:", "Since:", "Example"/"Examples", "Returns:", "TODO:".
-The section ends with the start of a new section.
+The feature descriptions must be preceded by a blank line and then a
+line "Features:", like this::
-The text of a section can start on a new line, in
-which case it must not be indented at all. It can also start
-on the same line as the 'Note:', 'Returns:', etc tag. In this
-case if it spans multiple lines then second and subsequent
-lines must be indented to match the first, in the same way as
-multiline argument descriptions.
+ #
+ # Features:
+ #
+ # @feature: Description text
-A 'Since: x.y.z' tagged section lists the release that introduced the
-definition.
+A tagged section begins with a paragraph that starts with one of the
+following words: "Note:"/"Notes:", "Since:", "Example:"/"Examples:",
+"Returns:", "Errors:", "TODO:". It ends with the start of a new
+section.
-The text of a section can start on a new line, in
-which case it must not be indented at all. It can also start
-on the same line as the 'Note:', 'Returns:', etc tag. In this
-case if it spans multiple lines then second and subsequent
-lines must be indented to match the first.
+The second and subsequent lines of tagged sections must be indented
+like this::
-An 'Example' or 'Examples' section is automatically rendered
-entirely as literal fixed-width text. In other sections,
-the text is formatted, and rST markup can be used.
+ # Note: Ut enim ad minim veniam, quis nostrud exercitation ullamco
+ # laboris nisi ut aliquip ex ea commodo consequat.
+ #
+ # Duis aute irure dolor in reprehenderit in voluptate velit esse
+ # cillum dolore eu fugiat nulla pariatur.
+
+"Returns" and "Errors" sections are only valid for commands. They
+document the success and the error response, respectively.
+
+A "Since: x.y.z" tagged section lists the release that introduced the
+definition.
+
+An "Example" or "Examples" section is rendered entirely
+as literal fixed-width text. "TODO" sections are not rendered at all
+(they are for developers, not users of QMP). In other sections, the
+text is formatted, and rST markup can be used.
For example::
@@ -1011,13 +1027,13 @@ For example::
# Statistics of a virtual block device or a block backing device.
#
# @device: If the stats are for a virtual block device, the name
- # corresponding to the virtual block device.
+ # corresponding to the virtual block device.
#
- # @node-name: The node name of the device. (since 2.3)
+ # @node-name: The node name of the device. (Since 2.3)
#
# ... more members ...
#
- # Since: 0.14.0
+ # Since: 0.14
##
{ 'struct': 'BlockStats',
'data': {'*device': 'str', '*node-name': 'str',
@@ -1028,26 +1044,85 @@ For example::
#
# Query the @BlockStats for all virtual block devices.
#
- # @query-nodes: If true, the command will query all the
- # block nodes ... explain, explain ... (since 2.3)
+ # @query-nodes: If true, the command will query all the block nodes
+ # ... explain, explain ...
+ # (Since 2.3)
#
# Returns: A list of @BlockStats for each virtual block devices.
#
- # Since: 0.14.0
+ # Since: 0.14
#
# Example:
#
- # -> { "execute": "query-blockstats" }
- # <- {
- # ... lots of output ...
- # }
- #
+ # -> { "execute": "query-blockstats" }
+ # <- {
+ # ... lots of output ...
+ # }
##
{ 'command': 'query-blockstats',
'data': { '*query-nodes': 'bool' },
'returns': ['BlockStats'] }
+Markup pitfalls
+~~~~~~~~~~~~~~~
+
+A blank line is required between list items and paragraphs. Without
+it, the list may not be recognized, resulting in garbled output. Good
+example::
+
+ # An event's state is modified if:
+ #
+ # - its name matches the @name pattern, and
+ # - if @vcpu is given, the event has the "vcpu" property.
+
+Without the blank line this would be a single paragraph.
+
+Indentation matters. Bad example::
+
+ # @none: None (no memory side cache in this proximity domain,
+ # or cache associativity unknown)
+ # (since 5.0)
+
+The last line's de-indent is wrong. The second and subsequent lines
+need to line up with each other, like this::
+
+ # @none: None (no memory side cache in this proximity domain,
+ # or cache associativity unknown)
+ # (since 5.0)
+
+Section tags are case-sensitive and end with a colon. They are only
+recognized after a blank line. Good example::
+
+ #
+ # Since: 7.1
+
+Bad examples (all ordinary paragraphs)::
+
+ # since: 7.1
+
+ # Since 7.1
+
+ # Since : 7.1
+
+Likewise, member descriptions require a colon. Good example::
+
+ # @interface-id: Interface ID
+
+Bad examples (all ordinary paragraphs)::
+
+ # @interface-id Interface ID
+
+ # @interface-id : Interface ID
+
+Undocumented members are not flagged, yet. Instead, the generated
+documentation describes them as "Not documented". Think twice before
+adding more undocumented members.
+
+When you change documentation comments, please check the generated
+documentation comes out as intended!
+
+
Client JSON Protocol introspection
==================================
@@ -1148,16 +1223,16 @@ Example: the SchemaInfo for EVENT_C from section Events_ ::
Type "q_obj-EVENT_C-arg" is an implicitly defined object type with
the two members from the event's definition.
-The SchemaInfo for struct and union types has meta-type "object".
-
-The SchemaInfo for a struct type has variant member "members".
+The SchemaInfo for struct and union types has meta-type "object" and
+variant member "members".
The SchemaInfo for a union type additionally has variant members "tag"
and "variants".
"members" is a JSON array describing the object's common members, if
any. Each element is a JSON object with members "name" (the member's
-name), "type" (the name of its type), and optionally "default". The
+name), "type" (the name of its type), "features" (a JSON array of
+feature strings), and "default". The latter two are optional. The
member is optional if "default" is present. Currently, "default" can
only have value null. Other values are reserved for future
extensions. The "members" array is in no particular order; clients
@@ -1231,14 +1306,22 @@ Example: the SchemaInfo for ['str'] ::
"element-type": "str" }
The SchemaInfo for an enumeration type has meta-type "enum" and
-variant member "values". The values are listed in no particular
-order; clients must search the entire enum when learning whether a
-particular value is supported.
+variant member "members".
+
+"members" is a JSON array describing the enumeration values. Each
+element is a JSON object with member "name" (the member's name), and
+optionally "features" (a JSON array of feature strings). The
+"members" array is in no particular order; clients must search the
+entire array when learning whether a particular value is supported.
Example: the SchemaInfo for MyEnum from section `Enumeration types`_ ::
{ "name": "MyEnum", "meta-type": "enum",
- "values": [ "value1", "value2", "value3" ] }
+ "members": [
+ { "name": "value1" },
+ { "name": "value2" },
+ { "name": "value3" }
+ ] }
The SchemaInfo for a built-in type has the same name as the type in
the QAPI schema (see section `Built-in Types`_), with one exception
@@ -1339,7 +1422,7 @@ qmp_my_command(); everything else is produced by the generator. ::
$ cat example-schema.json
{ 'struct': 'UserDefOne',
- 'data': { 'integer': 'int', '*string': 'str' } }
+ 'data': { 'integer': 'int', '*string': 'str', '*flag': 'bool' } }
{ 'command': 'my-command',
'data': { 'arg1': ['UserDefOne'] },
@@ -1392,8 +1475,9 @@ Example::
struct UserDefOne {
int64_t integer;
- bool has_string;
char *string;
+ bool has_flag;
+ bool flag;
};
void qapi_free_UserDefOne(UserDefOne *obj);
@@ -1505,14 +1589,21 @@ Example::
bool visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp)
{
+ bool has_string = !!obj->string;
+
if (!visit_type_int(v, "integer", &obj->integer, errp)) {
return false;
}
- if (visit_optional(v, "string", &obj->has_string)) {
+ if (visit_optional(v, "string", &has_string)) {
if (!visit_type_str(v, "string", &obj->string, errp)) {
return false;
}
}
+ if (visit_optional(v, "flag", &obj->has_flag)) {
+ if (!visit_type_bool(v, "flag", &obj->flag, errp)) {
+ return false;
+ }
+ }
return true;
}
@@ -1613,6 +1704,9 @@ The following files are generated:
``$(prefix)qapi-commands.h``
Function prototypes for the QMP commands specified in the schema
+ ``$(prefix)qapi-commands.trace-events``
+ Trace event declarations, see :ref:`tracing`.
+
``$(prefix)qapi-init-commands.h``
Command initialization prototype
@@ -1633,10 +1727,16 @@ Example::
void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp);
#endif /* EXAMPLE_QAPI_COMMANDS_H */
+
+ $ cat qapi-generated/example-qapi-commands.trace-events
+ # AUTOMATICALLY GENERATED, DO NOT MODIFY
+
+ qmp_enter_my_command(const char *json) "%s"
+ qmp_exit_my_command(const char *result, bool succeeded) "%s %d"
+
$ cat qapi-generated/example-qapi-commands.c
[Uninteresting stuff omitted...]
-
static void qmp_marshal_output_UserDefOne(UserDefOne *ret_in,
QObject **ret_out, Error **errp)
{
@@ -1672,14 +1772,27 @@ Example::
goto out;
}
+ if (trace_event_get_state_backends(TRACE_QMP_ENTER_MY_COMMAND)) {
+ g_autoptr(GString) req_json = qobject_to_json(QOBJECT(args));
+
+ trace_qmp_enter_my_command(req_json->str);
+ }
+
retval = qmp_my_command(arg.arg1, &err);
- error_propagate(errp, err);
if (err) {
+ trace_qmp_exit_my_command(error_get_pretty(err), false);
+ error_propagate(errp, err);
goto out;
}
qmp_marshal_output_UserDefOne(retval, ret, errp);
+ if (trace_event_get_state_backends(TRACE_QMP_EXIT_MY_COMMAND)) {
+ g_autoptr(GString) ret_json = qobject_to_json(*ret);
+
+ trace_qmp_exit_my_command(ret_json->str, true);
+ }
+
out:
visit_free(v);
v = qapi_dealloc_visitor_new();
@@ -1707,7 +1820,7 @@ Example::
QTAILQ_INIT(cmds);
qmp_register_command(cmds, "my-command",
- qmp_marshal_my_command, QCO_NO_OPTIONS);
+ qmp_marshal_my_command, 0, 0);
}
[Uninteresting stuff omitted...]
@@ -1876,6 +1989,12 @@ Example::
{ "type", QLIT_QSTR("str"), },
{}
})),
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "default", QLIT_QNULL, },
+ { "name", QLIT_QSTR("flag"), },
+ { "type", QLIT_QSTR("bool"), },
+ {}
+ })),
{}
})), },
{ "meta-type", QLIT_QSTR("object"), },
@@ -1909,6 +2028,12 @@ Example::
{ "name", QLIT_QSTR("str"), },
{}
})),
+ QLIT_QDICT(((QLitDictEntry[]) {
+ { "json-type", QLIT_QSTR("boolean"), },
+ { "meta-type", QLIT_QSTR("builtin"), },
+ { "name", QLIT_QSTR("bool"), },
+ {}
+ })),
{}
}));
diff --git a/docs/devel/qdev-api.rst b/docs/devel/qdev-api.rst
new file mode 100644
index 0000000000..3f35eea025
--- /dev/null
+++ b/docs/devel/qdev-api.rst
@@ -0,0 +1,7 @@
+.. _qdev-api:
+
+================================
+QEMU Device (qdev) API Reference
+================================
+
+.. kernel-doc:: include/hw/qdev-core.h
diff --git a/docs/devel/qgraph.rst b/docs/devel/qgraph.rst
index c2882c3a33..43342d9d65 100644
--- a/docs/devel/qgraph.rst
+++ b/docs/devel/qgraph.rst
@@ -1,8 +1,7 @@
.. _qgraph:
-========================================
Qtest Driver Framework
-========================================
+======================
In order to test a specific driver, plain libqos tests need to
take care of booting QEMU with the right machine and devices.
@@ -15,7 +14,7 @@ support that device.
Using only libqos APIs, the test has to manually take care of
covering all the setups, and build the correct command line.
-This also introduces backward compability issues: if a device/driver command
+This also introduces backward compatibility issues: if a device/driver command
line name is changed, all tests that use that will not work
properly anymore and need to be adjusted.
@@ -31,13 +30,15 @@ so the sdhci-test should only care of linking its qgraph node with
that interface. In this way, if the command line of a sdhci driver
is changed, only the respective qgraph driver node has to be adjusted.
+QGraph concepts
+---------------
+
The graph is composed by nodes that represent machines, drivers, tests
and edges that define the relationships between them (``CONSUMES``, ``PRODUCES``, and
``CONTAINS``).
-
Nodes
-^^^^^^
+~~~~~
A node can be of four types:
@@ -64,7 +65,7 @@ Notes for the nodes:
drivers name, otherwise they won't be discovered
Edges
-^^^^^^
+~~~~~
An edge relation between two nodes (drivers or machines) ``X`` and ``Y`` can be:
@@ -73,7 +74,7 @@ An edge relation between two nodes (drivers or machines) ``X`` and ``Y`` can be:
- ``X CONTAINS Y``: ``Y`` is part of ``X`` component
Execution steps
-^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~
The basic framework steps are the following:
@@ -92,8 +93,64 @@ The basic framework steps are the following:
Depending on the QEMU binary used, only some drivers/machines will be
available and only test that are reached by them will be executed.
+Command line
+~~~~~~~~~~~~
+
+Command line is built by using node names and optional arguments
+passed by the user when building the edges.
+
+There are three types of command line arguments:
+
+- ``in node`` : created from the node name. For example, machines will
+ have ``-M <machine>`` to its command line, while devices
+ ``-device <device>``. It is automatically done by the framework.
+- ``after node`` : added as additional argument to the node name.
+ This argument is added optionally when creating edges,
+ by setting the parameter ``after_cmd_line`` and
+ ``extra_edge_opts`` in ``QOSGraphEdgeOptions``.
+ The framework automatically adds
+ a comma before ``extra_edge_opts``,
+ because it is going to add attributes
+ after the destination node pointed by
+ the edge containing these options, and automatically
+ adds a space before ``after_cmd_line``, because it
+ adds an additional device, not an attribute.
+- ``before node`` : added as additional argument to the node name.
+ This argument is added optionally when creating edges,
+ by setting the parameter ``before_cmd_line`` in
+ ``QOSGraphEdgeOptions``. This attribute
+ is going to add attributes before the destination node
+ pointed by the edge containing these options. It is
+ helpful to commands that are not node-representable,
+ such as ``-fdsev`` or ``-netdev``.
+
+While adding command line in edges is always used, not all nodes names are
+used in every path walk: this is because the contained or produced ones
+are already added by QEMU, so only nodes that "consumes" will be used to
+build the command line. Also, nodes that will have ``{ "abstract" : true }``
+as QMP attribute will loose their command line, since they are not proper
+devices to be added in QEMU.
+
+Example::
+
+ QOSGraphEdgeOptions opts = {
+ .before_cmd_line = "-drive id=drv0,if=none,file=null-co://,"
+ "file.read-zeroes=on,format=raw",
+ .after_cmd_line = "-device scsi-hd,bus=vs0.0,drive=drv0",
+
+ opts.extra_device_opts = "id=vs0";
+ };
+
+ qos_node_create_driver("virtio-scsi-device",
+ virtio_scsi_device_create);
+ qos_node_consumes("virtio-scsi-device", "virtio-bus", &opts);
+
+Will produce the following command line:
+``-drive id=drv0,if=none,file=null-co://, -device virtio-scsi-device,id=vs0 -device scsi-hd,bus=vs0.0,drive=drv0``
+
Troubleshooting unavailable tests
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
If there is no path from an available machine to a test then that test will be
unavailable and won't execute. This can happen if a test or driver did not set
up its qgraph node correctly. It can also happen if the necessary machine type
@@ -151,7 +208,7 @@ Typically this is because the QEMU binary lacks support for the necessary
machine type or device.
Creating a new driver and its interface
-"""""""""""""""""""""""""""""""""""""""""
+---------------------------------------
Here we continue the ``sdhci`` use case, with the following scenario:
@@ -489,7 +546,7 @@ or inverting the consumes edge in consumed_by::
arm/raspi2b --contains--> generic-sdhci
Adding a new test
-"""""""""""""""""
+-----------------
Given the above setup, adding a new test is very simple.
``sdhci-test``, taken from ``tests/qtest/sdhci-test.c``::
@@ -565,62 +622,7 @@ and for the binary ``QTEST_QEMU_BINARY=./qemu-system-arm``:
Additional examples are also in ``test-qgraph.c``
-Command line:
-""""""""""""""
-
-Command line is built by using node names and optional arguments
-passed by the user when building the edges.
-
-There are three types of command line arguments:
-
-- ``in node`` : created from the node name. For example, machines will
- have ``-M <machine>`` to its command line, while devices
- ``-device <device>``. It is automatically done by the framework.
-- ``after node`` : added as additional argument to the node name.
- This argument is added optionally when creating edges,
- by setting the parameter ``after_cmd_line`` and
- ``extra_edge_opts`` in ``QOSGraphEdgeOptions``.
- The framework automatically adds
- a comma before ``extra_edge_opts``,
- because it is going to add attributes
- after the destination node pointed by
- the edge containing these options, and automatically
- adds a space before ``after_cmd_line``, because it
- adds an additional device, not an attribute.
-- ``before node`` : added as additional argument to the node name.
- This argument is added optionally when creating edges,
- by setting the parameter ``before_cmd_line`` in
- ``QOSGraphEdgeOptions``. This attribute
- is going to add attributes before the destination node
- pointed by the edge containing these options. It is
- helpful to commands that are not node-representable,
- such as ``-fdsev`` or ``-netdev``.
-
-While adding command line in edges is always used, not all nodes names are
-used in every path walk: this is because the contained or produced ones
-are already added by QEMU, so only nodes that "consumes" will be used to
-build the command line. Also, nodes that will have ``{ "abstract" : true }``
-as QMP attribute will loose their command line, since they are not proper
-devices to be added in QEMU.
-
-Example::
-
- QOSGraphEdgeOptions opts = {
- .before_cmd_line = "-drive id=drv0,if=none,file=null-co://,"
- "file.read-zeroes=on,format=raw",
- .after_cmd_line = "-device scsi-hd,bus=vs0.0,drive=drv0",
-
- opts.extra_device_opts = "id=vs0";
- };
-
- qos_node_create_driver("virtio-scsi-device",
- virtio_scsi_device_create);
- qos_node_consumes("virtio-scsi-device", "virtio-bus", &opts);
-
-Will produce the following command line:
-``-drive id=drv0,if=none,file=null-co://, -device virtio-scsi-device,id=vs0 -device scsi-hd,bus=vs0.0,drive=drv0``
-
Qgraph API reference
-^^^^^^^^^^^^^^^^^^^^
+--------------------
.. kernel-doc:: tests/qtest/libqos/qgraph.h
diff --git a/docs/devel/qom-api.rst b/docs/devel/qom-api.rst
new file mode 100644
index 0000000000..ed1f17e797
--- /dev/null
+++ b/docs/devel/qom-api.rst
@@ -0,0 +1,9 @@
+.. _qom-api:
+
+=====================================
+QEMU Object Model (QOM) API Reference
+=====================================
+
+This is the complete API documentation for :ref:`qom`.
+
+.. kernel-doc:: include/qom/object.h
diff --git a/docs/devel/qom.rst b/docs/devel/qom.rst
index e5fe3597cd..0889ca949c 100644
--- a/docs/devel/qom.rst
+++ b/docs/devel/qom.rst
@@ -1,3 +1,5 @@
+.. _qom:
+
===========================
The QEMU Object Model (QOM)
===========================
@@ -11,6 +13,24 @@ features:
- System for dynamically registering types
- Support for single-inheritance of types
- Multiple inheritance of stateless interfaces
+- Mapping internal members to publicly exposed properties
+
+The root object class is TYPE_OBJECT which provides for the basic
+object methods.
+
+The QOM tree
+============
+
+The QOM tree is a composition tree which represents all of the objects
+that make up a QEMU "machine". You can view this tree by running
+``info qom-tree`` in the :ref:`QEMU monitor`. It will contain both
+objects created by the machine itself as well those created due to
+user configuration.
+
+Creating a QOM class
+====================
+
+A simple minimal device implementation may look something like below:
.. code-block:: c
:caption: Creating a minimal type
@@ -24,7 +44,7 @@ features:
typedef DeviceClass MyDeviceClass;
typedef struct MyDevice
{
- DeviceState parent;
+ DeviceState parent_obj;
int reg0, reg1, reg2;
} MyDevice;
@@ -46,6 +66,12 @@ In the above example, we create a simple type that is described by #TypeInfo.
#TypeInfo describes information about the type including what it inherits
from, the instance and class size, and constructor/destructor hooks.
+The TYPE_DEVICE class is the parent class for all modern devices
+implemented in QEMU and adds some specific methods to handle QEMU
+device model. This includes managing the lifetime of devices from
+creation through to when they become visible to the guest and
+eventually unrealized.
+
Alternatively several static types could be registered using helper macro
DEFINE_TYPES()
@@ -96,7 +122,7 @@ when the object is needed.
module_obj(TYPE_MY_DEVICE);
Class Initialization
-====================
+--------------------
Before an object is initialized, the class for the object must be
initialized. There is only one class object for all instance objects
@@ -145,7 +171,7 @@ will also have a wrapper function to call it easily:
typedef struct MyDeviceClass
{
- DeviceClass parent;
+ DeviceClass parent_class;
void (*frobnicate) (MyDevice *obj);
} MyDeviceClass;
@@ -166,7 +192,7 @@ will also have a wrapper function to call it easily:
}
Interfaces
-==========
+----------
Interfaces allow a limited form of multiple inheritance. Instances are
similar to normal types except for the fact that are only defined by
@@ -180,7 +206,7 @@ an argument to a method on its corresponding SomethingIfClass, or to
dynamically cast it to an object that implements the interface.
Methods
-=======
+-------
A *method* is a function within the namespace scope of
a class. It usually operates on the object instance by passing it as a
@@ -273,8 +299,8 @@ Alternatively, object_class_by_name() can be used to obtain the class and
its non-overridden methods for a specific type. This would correspond to
``MyClass::method(...)`` in C++.
-The first example of such a QOM method was #CPUClass.reset,
-another example is #DeviceClass.realize.
+One example of such methods is ``DeviceClass.reset``. More examples
+can be found at :ref:`device-life-cycle`.
Standard type declaration and definition macros
===============================================
@@ -292,8 +318,7 @@ in the header file:
.. code-block:: c
:caption: Declaring a simple type
- OBJECT_DECLARE_SIMPLE_TYPE(MyDevice, my_device,
- MY_DEVICE, DEVICE)
+ OBJECT_DECLARE_SIMPLE_TYPE(MyDevice, MY_DEVICE)
This is equivalent to the following:
@@ -323,12 +348,14 @@ used. This does the same as OBJECT_DECLARE_SIMPLE_TYPE(), but without
the 'struct MyDeviceClass' definition.
To implement the type, the OBJECT_DEFINE macro family is available.
-In the simple case the OBJECT_DEFINE_TYPE macro is suitable:
+For the simplest case of a leaf class which doesn't need any of its
+own virtual functions (i.e. which was declared with OBJECT_DECLARE_SIMPLE_TYPE)
+the OBJECT_DEFINE_SIMPLE_TYPE macro is suitable:
.. code-block:: c
:caption: Defining a simple type
- OBJECT_DEFINE_TYPE(MyDevice, my_device, MY_DEVICE, DEVICE)
+ OBJECT_DEFINE_SIMPLE_TYPE(MyDevice, my_device, MY_DEVICE, DEVICE)
This is equivalent to the following:
@@ -345,7 +372,6 @@ This is equivalent to the following:
.instance_size = sizeof(MyDevice),
.instance_init = my_device_init,
.instance_finalize = my_device_finalize,
- .class_size = sizeof(MyDeviceClass),
.class_init = my_device_class_init,
};
@@ -360,20 +386,43 @@ This is sufficient to get the type registered with the type
system, and the three standard methods now need to be implemented
along with any other logic required for the type.
+If the class needs its own virtual methods, or has some other
+per-class state it needs to store in its own class struct,
+then you can use the OBJECT_DEFINE_TYPE macro. This does the
+same thing as OBJECT_DEFINE_SIMPLE_TYPE, but it also sets the
+class_size of the type to the size of the class struct.
+
+.. code-block:: c
+ :caption: Defining a type which needs a class struct
+
+ OBJECT_DEFINE_TYPE(MyDevice, my_device, MY_DEVICE, DEVICE)
+
If the type needs to implement one or more interfaces, then the
-OBJECT_DEFINE_TYPE_WITH_INTERFACES() macro can be used instead.
-This accepts an array of interface type names.
+OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES() and
+OBJECT_DEFINE_TYPE_WITH_INTERFACES() macros can be used instead.
+These accept an array of interface type names. The difference between
+them is that the former is for simple leaf classes that don't need
+a class struct, and the latter is for when you will be defining
+a class struct.
.. code-block:: c
:caption: Defining a simple type implementing interfaces
+ OBJECT_DEFINE_SIMPLE_TYPE_WITH_INTERFACES(MyDevice, my_device,
+ MY_DEVICE, DEVICE,
+ { TYPE_USER_CREATABLE },
+ { NULL })
+
+.. code-block:: c
+ :caption: Defining a type implementing interfaces
+
OBJECT_DEFINE_TYPE_WITH_INTERFACES(MyDevice, my_device,
MY_DEVICE, DEVICE,
{ TYPE_USER_CREATABLE },
{ NULL })
-If the type is not intended to be instantiated, then then
-the OBJECT_DEFINE_ABSTRACT_TYPE() macro can be used instead:
+If the type is not intended to be instantiated, then the
+OBJECT_DEFINE_ABSTRACT_TYPE() macro can be used instead:
.. code-block:: c
:caption: Defining a simple abstract type
@@ -381,9 +430,32 @@ the OBJECT_DEFINE_ABSTRACT_TYPE() macro can be used instead:
OBJECT_DEFINE_ABSTRACT_TYPE(MyDevice, my_device,
MY_DEVICE, DEVICE)
+.. _device-life-cycle:
+
+Device Life-cycle
+=================
+
+As class initialisation cannot fail devices have an two additional
+methods to handle the creation of dynamic devices. The ``realize``
+function is called with ``Error **`` pointer which should be set if
+the device cannot complete its setup. Otherwise on successful
+completion of the ``realize`` method the device object is added to the
+QOM tree and made visible to the guest.
+
+The reverse function is ``unrealize`` and should be were clean-up
+code lives to tidy up after the system is done with the device.
+
+All devices can be instantiated by C code, however only some can
+created dynamically via the command line or monitor.
+Likewise only some can be unplugged after creation and need an
+explicit ``unrealize`` implementation. This is determined by the
+``user_creatable`` variable in the root ``DeviceClass`` structure.
+Devices can only be unplugged if their ``parent_bus`` has a registered
+``HotplugHandler``.
API Reference
--------------
+=============
-.. kernel-doc:: include/qom/object.h
+See the :ref:`QOM API<qom-api>` and :ref:`QDEV API<qdev-api>`
+documents for the complete API description.
diff --git a/docs/devel/qtest.rst b/docs/devel/qtest.rst
index c3dceb6c8a..c5b8546b3e 100644
--- a/docs/devel/qtest.rst
+++ b/docs/devel/qtest.rst
@@ -3,7 +3,6 @@ QTest Device Emulation Testing Framework
========================================
.. toctree::
- :hidden:
qgraph
@@ -82,11 +81,11 @@ which you can run manually.
QTest Protocol
--------------
-.. kernel-doc:: softmmu/qtest.c
+.. kernel-doc:: system/qtest.c
:doc: QTest Protocol
libqtest API reference
----------------------
-.. kernel-doc:: tests/qtest/libqos/libqtest.h
+.. kernel-doc:: tests/qtest/libqtest.h
diff --git a/docs/devel/replay.rst b/docs/devel/replay.rst
new file mode 100644
index 0000000000..effd856f0c
--- /dev/null
+++ b/docs/devel/replay.rst
@@ -0,0 +1,306 @@
+..
+ Copyright (c) 2022, ISP RAS
+ Written by Pavel Dovgalyuk and Alex Bennée
+
+=======================
+Execution Record/Replay
+=======================
+
+Core concepts
+=============
+
+Record/replay functions are used for the deterministic replay of qemu
+execution. Execution recording writes a non-deterministic events log, which
+can be later used for replaying the execution anywhere and for unlimited
+number of times. Execution replaying reads the log and replays all
+non-deterministic events including external input, hardware clocks,
+and interrupts.
+
+Several parts of QEMU include function calls to make event log recording
+and replaying.
+Devices' models that have non-deterministic input from external devices were
+changed to write every external event into the execution log immediately.
+E.g. network packets are written into the log when they arrive into the virtual
+network adapter.
+
+All non-deterministic events are coming from these devices. But to
+replay them we need to know at which moments they occur. We specify
+these moments by counting the number of instructions executed between
+every pair of consecutive events.
+
+Academic papers with description of deterministic replay implementation:
+
+* `Deterministic Replay of System's Execution with Multi-target QEMU Simulator for Dynamic Analysis and Reverse Debugging <https://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html>`_
+* `Don't panic: reverse debugging of kernel drivers <https://dl.acm.org/citation.cfm?id=2786805.2803179>`_
+
+Modifications of qemu include:
+
+ * wrappers for clock and time functions to save their return values in the log
+ * saving different asynchronous events (e.g. system shutdown) into the log
+ * synchronization of the bottom halves execution
+ * synchronization of the threads from thread pool
+ * recording/replaying user input (mouse, keyboard, and microphone)
+ * adding internal checkpoints for cpu and io synchronization
+ * network filter for recording and replaying the packets
+ * block driver for making block layer deterministic
+ * serial port input record and replay
+ * recording of random numbers obtained from the external sources
+
+Instruction counting
+--------------------
+
+QEMU should work in icount mode to use record/replay feature. icount was
+designed to allow deterministic execution in absence of external inputs
+of the virtual machine. We also use icount to control the occurrence of the
+non-deterministic events. The number of instructions elapsed from the last event
+is written to the log while recording the execution. In replay mode we
+can predict when to inject that event using the instruction counter.
+
+Locking and thread synchronisation
+----------------------------------
+
+Previously the synchronisation of the main thread and the vCPU thread
+was ensured by the holding of the BQL. However the trend has been to
+reduce the time the BQL was held across the system including under TCG
+system emulation. As it is important that batches of events are kept
+in sequence (e.g. expiring timers and checkpoints in the main thread
+while instruction checkpoints are written by the vCPU thread) we need
+another lock to keep things in lock-step. This role is now handled by
+the replay_mutex_lock. It used to be held only for each event being
+written but now it is held for a whole execution period. This results
+in a deterministic ping-pong between the two main threads.
+
+As the BQL is now a finer grained lock than the replay_lock it is almost
+certainly a bug, and a source of deadlocks, to take the
+replay_mutex_lock while the BQL is held. This is enforced by an assert.
+While the unlocks are usually in the reverse order, this is not
+necessary; you can drop the replay_lock while holding the BQL, without
+doing a more complicated unlock_iothread/replay_unlock/lock_iothread
+sequence.
+
+Checkpoints
+-----------
+
+Replaying the execution of virtual machine is bound by sources of
+non-determinism. These are inputs from clock and peripheral devices,
+and QEMU thread scheduling. Thread scheduling affect on processing events
+from timers, asynchronous input-output, and bottom halves.
+
+Invocations of timers are coupled with clock reads and changing the state
+of the virtual machine. Reads produce non-deterministic data taken from
+host clock. And VM state changes should preserve their order. Their relative
+order in replay mode must replicate the order of callbacks in record mode.
+To preserve this order we use checkpoints. When a specific clock is processed
+in record mode we save to the log special "checkpoint" event.
+Checkpoints here do not refer to virtual machine snapshots. They are just
+record/replay events used for synchronization.
+
+QEMU in replay mode will try to invoke timers processing in random moment
+of time. That's why we do not process a group of timers until the checkpoint
+event will be read from the log. Such an event allows synchronizing CPU
+execution and timer events.
+
+Two other checkpoints govern the "warping" of the virtual clock.
+While the virtual machine is idle, the virtual clock increments at
+1 ns per *real time* nanosecond. This is done by setting up a timer
+(called the warp timer) on the virtual real time clock, so that the
+timer fires at the next deadline of the virtual clock; the virtual clock
+is then incremented (which is called "warping" the virtual clock) as
+soon as the timer fires or the CPUs need to go out of the idle state.
+Two functions are used for this purpose; because these actions change
+virtual machine state and must be deterministic, each of them creates a
+checkpoint. ``icount_start_warp_timer`` checks if the CPUs are idle and if so
+starts accounting real time to virtual clock. ``icount_account_warp_timer``
+is called when the CPUs get an interrupt or when the warp timer fires,
+and it warps the virtual clock by the amount of real time that has passed
+since ``icount_start_warp_timer``.
+
+Virtual devices
+===============
+
+Record/replay mechanism, that could be enabled through icount mode, expects
+the virtual devices to satisfy the following requirement:
+everything that affects
+the guest state during execution in icount mode should be deterministic.
+
+Timers
+------
+
+Timers are used to execute callbacks from different subsystems of QEMU
+at the specified moments of time. There are several kinds of timers:
+
+ * Real time clock. Based on host time and used only for callbacks that
+ do not change the virtual machine state. For this reason real time
+ clock and timers does not affect deterministic replay at all.
+ * Virtual clock. These timers run only during the emulation. In icount
+ mode virtual clock value is calculated using executed instructions counter.
+ That is why it is completely deterministic and does not have to be recorded.
+ * Host clock. This clock is used by device models that simulate real time
+ sources (e.g. real time clock chip). Host clock is the one of the sources
+ of non-determinism. Host clock read operations should be logged to
+ make the execution deterministic.
+ * Virtual real time clock. This clock is similar to real time clock but
+ it is used only for increasing virtual clock while virtual machine is
+ sleeping. Due to its nature it is also non-deterministic as the host clock
+ and has to be logged too.
+
+All virtual devices should use virtual clock for timers that change the guest
+state. Virtual clock is deterministic, therefore such timers are deterministic
+too.
+
+Virtual devices can also use realtime clock for the events that do not change
+the guest state directly. When the clock ticking should depend on VM execution
+speed, use virtual clock with EXTERNAL attribute. It is not deterministic,
+but its speed depends on the guest execution. This clock is used by
+the virtual devices (e.g., slirp routing device) that lie outside the
+replayed guest.
+
+Block devices
+-------------
+
+Block devices record/replay module (``blkreplay``) intercepts calls of
+bdrv coroutine functions at the top of block drivers stack.
+
+All block completion operations are added to the queue in the coroutines.
+When the queue is flushed the information about processed requests
+is recorded to the log. In replay phase the queue is matched with
+events read from the log. Therefore block devices requests are processed
+deterministically.
+
+Bottom halves
+-------------
+
+Bottom half callbacks, that affect the guest state, should be invoked through
+``replay_bh_schedule_event`` or ``replay_bh_schedule_oneshot_event`` functions.
+Their invocations are saved in record mode and synchronized with the existing
+log in replay mode.
+
+Disk I/O events are completely deterministic in our model, because
+in both record and replay modes we start virtual machine from the same
+disk state. But callbacks that virtual disk controller uses for reading and
+writing the disk may occur at different moments of time in record and replay
+modes.
+
+Reading and writing requests are created by CPU thread of QEMU. Later these
+requests proceed to block layer which creates "bottom halves". Bottom
+halves consist of callback and its parameters. They are processed when
+main loop locks the BQL. These locks are not synchronized with
+replaying process because main loop also processes the events that do not
+affect the virtual machine state (like user interaction with monitor).
+
+That is why we had to implement saving and replaying bottom halves callbacks
+synchronously to the CPU execution. When the callback is about to execute
+it is added to the queue in the replay module. This queue is written to the
+log when its callbacks are executed. In replay mode callbacks are not processed
+until the corresponding event is read from the events log file.
+
+Sometimes the block layer uses asynchronous callbacks for its internal purposes
+(like reading or writing VM snapshots or disk image cluster tables). In this
+case bottom halves are not marked as "replayable" and do not saved
+into the log.
+
+Saving/restoring the VM state
+-----------------------------
+
+All fields in the device state structure (including virtual timers)
+should be restored by loadvm to the same values they had before savevm.
+
+Avoid accessing other devices' state, because the order of saving/restoring
+is not defined. It means that you should not call functions like
+``update_irq`` in ``post_load`` callback. Save everything explicitly to avoid
+the dependencies that may make restoring the VM state non-deterministic.
+
+Stopping the VM
+---------------
+
+Stopping the guest should not interfere with its state (with the exception
+of the network connections, that could be broken by the remote timeouts).
+VM can be stopped at any moment of replay by the user. Restarting the VM
+after that stop should not break the replay by the unneeded guest state change.
+
+Replay log format
+=================
+
+Record/replay log consists of the header and the sequence of execution
+events. The header includes 4-byte replay version id and 8-byte reserved
+field. Version is updated every time replay log format changes to prevent
+using replay log created by another build of qemu.
+
+The sequence of the events describes virtual machine state changes.
+It includes all non-deterministic inputs of VM, synchronization marks and
+instruction counts used to correctly inject inputs at replay.
+
+Synchronization marks (checkpoints) are used for synchronizing qemu threads
+that perform operations with virtual hardware. These operations may change
+system's state (e.g., change some register or generate interrupt) and
+therefore should execute synchronously with CPU thread.
+
+Every event in the log includes 1-byte event id and optional arguments.
+When argument is an array, it is stored as 4-byte array length
+and corresponding number of bytes with data.
+Here is the list of events that are written into the log:
+
+ - EVENT_INSTRUCTION. Instructions executed since last event. Followed by:
+
+ - 4-byte number of executed instructions.
+
+ - EVENT_INTERRUPT. Used to synchronize interrupt processing.
+ - EVENT_EXCEPTION. Used to synchronize exception handling.
+ - EVENT_ASYNC. This is a group of events. When such an event is generated,
+ it is stored in the queue and processed in icount_account_warp_timer().
+ Every such event has it's own id from the following list:
+
+ - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
+ callbacks that affect virtual machine state, but normally called
+ asynchronously. Followed by:
+
+ - 8-byte operation id.
+
+ - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
+ parameters of keyboard and mouse input operations
+ (key press/release, mouse pointer movement). Followed by:
+
+ - 9-16 bytes depending of input event.
+
+ - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
+ - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
+ initiated by the sender. Followed by:
+
+ - 1-byte character device id.
+ - Array with bytes were read.
+
+ - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
+ operations with disk and flash drives with CPU. Followed by:
+
+ - 8-byte operation id.
+
+ - REPLAY_ASYNC_EVENT_NET. Incoming network packet. Followed by:
+
+ - 1-byte network adapter id.
+ - 4-byte packet flags.
+ - Array with packet bytes.
+
+ - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
+ e.g., by closing the window.
+ - EVENT_CHAR_WRITE. Used to synchronize character output operations. Followed by:
+
+ - 4-byte output function return value.
+ - 4-byte offset in the output array.
+
+ - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
+ initiated by qemu. Followed by:
+
+ - Array with bytes that were read.
+
+ - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
+ initiated by qemu. Followed by:
+
+ - 4-byte error code.
+
+ - EVENT_CLOCK + clock_id. Group of events for host clock read operations. Followed by:
+
+ - 8-byte clock value.
+
+ - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
+ CPU, internal threads, and asynchronous input events.
+ - EVENT_END. Last event in the log.
diff --git a/docs/devel/replay.txt b/docs/devel/replay.txt
deleted file mode 100644
index e641c35add..0000000000
--- a/docs/devel/replay.txt
+++ /dev/null
@@ -1,46 +0,0 @@
-Record/replay mechanism, that could be enabled through icount mode, expects
-the virtual devices to satisfy the following requirements.
-
-The main idea behind this document is that everything that affects
-the guest state during execution in icount mode should be deterministic.
-
-Timers
-======
-
-All virtual devices should use virtual clock for timers that change the guest
-state. Virtual clock is deterministic, therefore such timers are deterministic
-too.
-
-Virtual devices can also use realtime clock for the events that do not change
-the guest state directly. When the clock ticking should depend on VM execution
-speed, use virtual clock with EXTERNAL attribute. It is not deterministic,
-but its speed depends on the guest execution. This clock is used by
-the virtual devices (e.g., slirp routing device) that lie outside the
-replayed guest.
-
-Bottom halves
-=============
-
-Bottom half callbacks, that affect the guest state, should be invoked through
-replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions.
-Their invocations are saved in record mode and synchronized with the existing
-log in replay mode.
-
-Saving/restoring the VM state
-=============================
-
-All fields in the device state structure (including virtual timers)
-should be restored by loadvm to the same values they had before savevm.
-
-Avoid accessing other devices' state, because the order of saving/restoring
-is not defined. It means that you should not call functions like
-'update_irq' in post_load callback. Save everything explicitly to avoid
-the dependencies that may make restoring the VM state non-deterministic.
-
-Stopping the VM
-===============
-
-Stopping the guest should not interfere with its state (with the exception
-of the network connections, that could be broken by the remote timeouts).
-VM can be stopped at any moment of replay by the user. Restarting the VM
-after that stop should not break the replay by the unneeded guest state change.
diff --git a/docs/devel/reset.rst b/docs/devel/reset.rst
index abea1102dc..9746a4e8a0 100644
--- a/docs/devel/reset.rst
+++ b/docs/devel/reset.rst
@@ -11,15 +11,15 @@ whole group can be reset consistently. Each individual member object does not
have to care about others; in particular, problems of order (which object is
reset first) are addressed.
-As of now DeviceClass and BusClass implement this interface.
-
+The main object types which implement this interface are DeviceClass
+and BusClass.
Triggering reset
----------------
This section documents the APIs which "users" of a resettable object should use
to control it. All resettable control functions must be called while holding
-the iothread lock.
+the BQL.
You can apply a reset to an object using ``resettable_assert_reset()``. You need
to call ``resettable_release_reset()`` to release the object from reset. To
@@ -27,9 +27,7 @@ instantly reset an object, without keeping it in reset state, just call
``resettable_reset()``. These functions take two parameters: a pointer to the
object to reset and a reset type.
-Several types of reset will be supported. For now only cold reset is defined;
-others may be added later. The Resettable interface handles reset types with an
-enum:
+The Resettable interface handles reset types with an enum ``ResetType``:
``RESET_TYPE_COLD``
Cold reset is supported by every resettable object. In QEMU, it means we reset
@@ -37,6 +35,19 @@ enum:
from what is a real hardware cold reset. It differs from other resets (like
warm or bus resets) which may keep certain parts untouched.
+``RESET_TYPE_SNAPSHOT_LOAD``
+ This is called for a reset which is being done to put the system into a
+ clean state prior to loading a snapshot. (This corresponds to a reset
+ with ``SHUTDOWN_CAUSE_SNAPSHOT_LOAD``.) Almost all devices should treat
+ this the same as ``RESET_TYPE_COLD``. The main exception is devices which
+ have some non-deterministic state they want to reinitialize to a different
+ value on each cold reset, such as RNG seed information, and which they
+ must not reinitialize on a snapshot-load reset.
+
+Devices which implement reset methods must treat any unknown ``ResetType``
+as equivalent to ``RESET_TYPE_COLD``; this will reduce the amount of
+existing code we need to change if we add more types in future.
+
Calling ``resettable_reset()`` is equivalent to calling
``resettable_assert_reset()`` then ``resettable_release_reset()``. It is
possible to interleave multiple calls to these three functions. There may
@@ -150,25 +161,25 @@ in reset.
mydev->var = 0;
}
- static void mydev_reset_hold(Object *obj)
+ static void mydev_reset_hold(Object *obj, ResetType type)
{
MyDevClass *myclass = MYDEV_GET_CLASS(obj);
MyDevState *mydev = MYDEV(obj);
/* call parent class hold phase */
if (myclass->parent_phases.hold) {
- myclass->parent_phases.hold(obj);
+ myclass->parent_phases.hold(obj, type);
}
/* set an IO */
qemu_set_irq(mydev->irq, 1);
}
- static void mydev_reset_exit(Object *obj)
+ static void mydev_reset_exit(Object *obj, ResetType type)
{
MyDevClass *myclass = MYDEV_GET_CLASS(obj);
MyDevState *mydev = MYDEV(obj);
/* call parent class exit phase */
if (myclass->parent_phases.exit) {
- myclass->parent_phases.exit(obj);
+ myclass->parent_phases.exit(obj, type);
}
/* clear an IO */
qemu_set_irq(mydev->irq, 0);
@@ -184,21 +195,20 @@ in reset.
{
MyDevClass *myclass = MYDEV_CLASS(class);
ResettableClass *rc = RESETTABLE_CLASS(class);
- resettable_class_set_parent_reset_phases(rc,
- mydev_reset_enter,
- mydev_reset_hold,
- mydev_reset_exit,
- &myclass->parent_phases);
+ resettable_class_set_parent_phases(rc,
+ mydev_reset_enter,
+ mydev_reset_hold,
+ mydev_reset_exit,
+ &myclass->parent_phases);
}
In the above example, we override all three phases. It is possible to override
only some of them by passing NULL instead of a function pointer to
-``resettable_class_set_parent_reset_phases()``. For example, the following will
+``resettable_class_set_parent_phases()``. For example, the following will
only override the *enter* phase and leave *hold* and *exit* untouched::
- resettable_class_set_parent_reset_phases(rc, mydev_reset_enter,
- NULL, NULL,
- &myclass->parent_phases);
+ resettable_class_set_parent_phases(rc, mydev_reset_enter, NULL, NULL,
+ &myclass->parent_phases);
This is equivalent to providing a trivial implementation of the hold and exit
phases which does nothing but call the parent class's implementation of the
@@ -210,9 +220,11 @@ Polling the reset state
Resettable interface provides the ``resettable_is_in_reset()`` function.
This function returns true if the object parameter is currently under reset.
-An object is under reset from the beginning of the *init* phase to the end of
-the *exit* phase. During all three phases, the function will return that the
-object is in reset.
+An object is under reset from the beginning of the *enter* phase (before
+either its children or its own enter method is called) to the *exit*
+phase. During *enter* and *hold* phase only, the function will return that the
+object is in reset. The state is changed after the *exit* is propagated to
+its children and just before calling the object's own *exit* method.
This function may be used if the object behavior has to be adapted
while in reset state. For example if a device has an irq input,
@@ -287,3 +299,43 @@ There is currently 2 cases where this function is used:
2. *hot bus change*; it means an existing live device is added, moved or
removed in the bus hierarchy. At the moment, it occurs only in the raspi
machines for changing the sdbus used by sd card.
+
+Reset of the complete system
+----------------------------
+
+Reset of the complete system is a little complicated. The typical
+flow is:
+
+1. Code which wishes to reset the entire system does so by calling
+ ``qemu_system_reset_request()``. This schedules a reset, but the
+ reset will happen asynchronously after the function returns.
+ That makes this safe to call from, for example, device models.
+
+2. The function which is called to make the reset happen is
+ ``qemu_system_reset()``. Generally only core system code should
+ call this directly.
+
+3. ``qemu_system_reset()`` calls the ``MachineClass::reset`` method of
+ the current machine, if it has one. That method must call
+ ``qemu_devices_reset()``. If the machine has no reset method,
+ ``qemu_system_reset()`` calls ``qemu_devices_reset()`` directly.
+
+4. ``qemu_devices_reset()`` performs a reset of the system, using
+ the three-phase mechanism listed above. It resets all objects
+ that were registered with it using ``qemu_register_resettable()``.
+ It also calls all the functions registered with it using
+ ``qemu_register_reset()``. Those functions are called during the
+ "hold" phase of this reset.
+
+5. The most important object that this reset resets is the
+ 'sysbus' bus. The sysbus bus is the root of the qbus tree. This
+ means that all devices on the sysbus are reset, and all their
+ child buses, and all the devices on those child buses.
+
+6. Devices which are not on the qbus tree are *not* automatically
+ reset! (The most obvious example of this is CPU objects, but
+ anything that directly inherits from ``TYPE_OBJECT`` or ``TYPE_DEVICE``
+ rather than from ``TYPE_SYS_BUS_DEVICE`` or some other plugs-into-a-bus
+ type will be in this category.) You need to therefore arrange for these
+ to be reset in some other way (e.g. using ``qemu_register_resettable()``
+ or ``qemu_register_reset()``).
diff --git a/docs/devel/s390-cpu-topology.rst b/docs/devel/s390-cpu-topology.rst
new file mode 100644
index 0000000000..48313b92d4
--- /dev/null
+++ b/docs/devel/s390-cpu-topology.rst
@@ -0,0 +1,170 @@
+QAPI interface for S390 CPU topology
+====================================
+
+The following sections will explain the QAPI interface for S390 CPU topology
+with the help of exemplary output.
+For this, let's assume that QEMU has been started with the following
+command, defining 4 CPUs, where CPU[0] is defined by the -smp argument and will
+have default values:
+
+.. code-block:: bash
+
+ qemu-system-s390x \
+ -enable-kvm \
+ -cpu z14,ctop=on \
+ -smp 1,drawers=3,books=3,sockets=2,cores=2,maxcpus=36 \
+ -device z14-s390x-cpu,core-id=19,entitlement=high \
+ -device z14-s390x-cpu,core-id=11,entitlement=low \
+ -device z14-s390x-cpu,core-id=12,entitlement=high \
+ ...
+
+Additions to query-cpus-fast
+----------------------------
+
+The command query-cpus-fast allows querying the topology tree and
+modifiers for all configured vCPUs.
+
+.. code-block:: QMP
+
+ { "execute": "query-cpus-fast" }
+ {
+ "return": [
+ {
+ "dedicated": false,
+ "thread-id": 536993,
+ "props": {
+ "core-id": 0,
+ "socket-id": 0,
+ "drawer-id": 0,
+ "book-id": 0
+ },
+ "cpu-state": "operating",
+ "entitlement": "medium",
+ "qom-path": "/machine/unattached/device[0]",
+ "cpu-index": 0,
+ "target": "s390x"
+ },
+ {
+ "dedicated": false,
+ "thread-id": 537003,
+ "props": {
+ "core-id": 19,
+ "socket-id": 1,
+ "drawer-id": 0,
+ "book-id": 2
+ },
+ "cpu-state": "operating",
+ "entitlement": "high",
+ "qom-path": "/machine/peripheral-anon/device[0]",
+ "cpu-index": 19,
+ "target": "s390x"
+ },
+ {
+ "dedicated": false,
+ "thread-id": 537004,
+ "props": {
+ "core-id": 11,
+ "socket-id": 1,
+ "drawer-id": 0,
+ "book-id": 1
+ },
+ "cpu-state": "operating",
+ "entitlement": "low",
+ "qom-path": "/machine/peripheral-anon/device[1]",
+ "cpu-index": 11,
+ "target": "s390x"
+ },
+ {
+ "dedicated": true,
+ "thread-id": 537005,
+ "props": {
+ "core-id": 12,
+ "socket-id": 0,
+ "drawer-id": 3,
+ "book-id": 2
+ },
+ "cpu-state": "operating",
+ "entitlement": "high",
+ "qom-path": "/machine/peripheral-anon/device[2]",
+ "cpu-index": 12,
+ "target": "s390x"
+ }
+ ]
+ }
+
+
+QAPI command: set-cpu-topology
+------------------------------
+
+The command set-cpu-topology allows modifying the topology tree
+or the topology modifiers of a vCPU in the configuration.
+
+.. code-block:: QMP
+
+ { "execute": "set-cpu-topology",
+ "arguments": {
+ "core-id": 11,
+ "socket-id": 0,
+ "book-id": 0,
+ "drawer-id": 0,
+ "entitlement": "low",
+ "dedicated": false
+ }
+ }
+ {"return": {}}
+
+The core-id parameter is the only mandatory parameter and every
+unspecified parameter keeps its previous value.
+
+QAPI event CPU_POLARIZATION_CHANGE
+----------------------------------
+
+When a guest requests a modification of the polarization,
+QEMU sends a CPU_POLARIZATION_CHANGE event.
+
+When requesting the change, the guest only specifies horizontal or
+vertical polarization.
+It is the job of the entity administrating QEMU to set the dedication and fine
+grained vertical entitlement in response to this event.
+
+Note that a vertical polarized dedicated vCPU can only have a high
+entitlement, giving 6 possibilities for vCPU polarization:
+
+- Horizontal
+- Horizontal dedicated
+- Vertical low
+- Vertical medium
+- Vertical high
+- Vertical high dedicated
+
+Example of the event received when the guest issues the CPU instruction
+Perform Topology Function PTF(0) to request an horizontal polarization:
+
+.. code-block:: QMP
+
+ {
+ "timestamp": {
+ "seconds": 1687870305,
+ "microseconds": 566299
+ },
+ "event": "CPU_POLARIZATION_CHANGE",
+ "data": {
+ "polarization": "horizontal"
+ }
+ }
+
+QAPI query command: query-s390x-cpu-polarization
+------------------------------------------------
+
+The query command query-s390x-cpu-polarization returns the current
+CPU polarization of the machine.
+In this case the guest previously issued a PTF(1) to request vertical polarization:
+
+.. code-block:: QMP
+
+ { "execute": "query-s390x-cpu-polarization" }
+ {
+ "return": {
+ "polarization": "vertical"
+ }
+ }
diff --git a/docs/devel/stable-process.rst b/docs/devel/stable-process.rst
index e541b983fa..c21fb86645 100644
--- a/docs/devel/stable-process.rst
+++ b/docs/devel/stable-process.rst
@@ -1,3 +1,5 @@
+.. _stable-process:
+
QEMU and the stable process
===========================
diff --git a/docs/devel/style.rst b/docs/devel/style.rst
index 260e3263fa..2f68b50079 100644
--- a/docs/devel/style.rst
+++ b/docs/devel/style.rst
@@ -1,3 +1,5 @@
+.. _coding-style:
+
=================
QEMU Coding Style
=================
@@ -10,6 +12,10 @@ patches before submitting.
Formatting and style
********************
+The repository includes a ``.editorconfig`` file which can help with
+getting the right settings for your preferred $EDITOR. See
+`<https://editorconfig.org/>`_ for details.
+
Whitespace
==========
@@ -149,6 +155,12 @@ If there are two versions of a function to be called with or without a
lock held, the function that expects the lock to be already held
usually uses the suffix ``_locked``.
+If a function is a shim designed to deal with compatibility
+workarounds we use the suffix ``_compat``. These are generally not
+called directly and aliased to the plain function name via the
+pre-processor. Another common suffix is ``_impl``; it is used for the
+concrete implementation of a function that will not be called
+directly, but rather through a macro or an inline function.
Block structure
===============
@@ -192,7 +204,14 @@ Declarations
Mixed declarations (interleaving statements and declarations within
blocks) are generally not allowed; declarations should be at the beginning
-of blocks.
+of blocks. To avoid accidental re-use it is permissible to declare
+loop variables inside for loops:
+
+.. code-block:: c
+
+ for (int i = 0; i < ARRAY_SIZE(thing); i++) {
+ /* do something loopy */
+ }
Every now and then, an exception is made for declarations inside a
#ifdef or #ifndef block: if the code looks nicer, such declarations can
@@ -281,6 +300,27 @@ that QEMU depends on.
Do not include "qemu/osdep.h" from header files since the .c file will have
already included it.
+Headers should normally include everything they need beyond osdep.h.
+If exceptions are needed for some reason, they must be documented in
+the header. If all that's needed from a header is typedefs, consider
+putting those into qemu/typedefs.h instead of including the header.
+
+Cyclic inclusion is forbidden.
+
+Generative Includes
+-------------------
+
+QEMU makes fairly extensive use of the macro pre-processor to
+instantiate multiple similar functions. While such abuse of the macro
+processor isn't discouraged it can make debugging and code navigation
+harder. You should consider carefully if the same effect can be
+achieved by making it easy for the compiler to constant fold or using
+python scripting to generate grep friendly code.
+
+If you do use template header files they should be named with the
+``.c.inc`` or ``.h.inc`` suffix to make it clear they are being
+included for expansion.
+
C types
=======
@@ -481,11 +521,11 @@ of arguments.
C standard, implementation defined and undefined behaviors
==========================================================
-C code in QEMU should be written to the C99 language specification. A copy
-of the final version of the C99 standard with corrigenda TC1, TC2, and TC3
-included, formatted as a draft, can be downloaded from:
+C code in QEMU should be written to the C11 language specification. A
+copy of the final version of the C11 standard formatted as a draft,
+can be downloaded from:
- `<http://www.open-std.org/jtc1/sc22/WG14/www/docs/n1256.pdf>`_
+ `<http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf>`_
The C language specification defines regions of undefined behavior and
implementation defined behavior (to give compiler authors enough leeway to
@@ -510,7 +550,7 @@ documented in the GNU Compiler Collection manual starting at version 4.0.
Automatic memory deallocation
=============================
-QEMU has a mandatory dependency either the GCC or CLang compiler. As
+QEMU has a mandatory dependency on either the GCC or the Clang compiler. As
such it has the freedom to make use of a C language extension for
automatically running a cleanup function when a stack variable goes
out of scope. This can be used to simplify function cleanup paths,
@@ -534,7 +574,8 @@ For example, instead of
.. code-block:: c
- int somefunc(void) {
+ int somefunc(void)
+ {
int ret = -1;
char *foo = g_strdup_printf("foo%", "wibble");
GList *bar = .....
@@ -555,7 +596,8 @@ Using g_autofree/g_autoptr enables the code to be written as:
.. code-block:: c
- int somefunc(void) {
+ int somefunc(void)
+ {
g_autofree char *foo = g_strdup_printf("foo%", "wibble");
g_autoptr (GList) bar = .....
@@ -580,7 +622,8 @@ are still some caveats to beware of
.. code-block:: c
- char *somefunc(void) {
+ char *somefunc(void)
+ {
g_autofree char *foo = g_strdup_printf("foo%", "wibble");
g_autoptr (GList) bar = .....
@@ -595,6 +638,97 @@ are still some caveats to beware of
QEMU Specific Idioms
********************
+QEMU Object Model Declarations
+==============================
+
+The QEMU Object Model (QOM) provides a framework for handling objects
+in the base C language. The first declaration of a storage or class
+structure should always be the parent and leave a visual space between
+that declaration and the new code. It is also useful to separate
+backing for properties (options driven by the user) and internal state
+to make navigation easier.
+
+For a storage structure the first declaration should always be called
+"parent_obj" and for a class structure the first member should always
+be called "parent_class" as below:
+
+.. code-block:: c
+
+ struct MyDeviceState {
+ DeviceState parent_obj;
+
+ /* Properties */
+ int prop_a;
+ char *prop_b;
+ /* Other stuff */
+ int internal_state;
+ };
+
+ struct MyDeviceClass {
+ DeviceClass parent_class;
+
+ void (*new_fn1)(void);
+ bool (*new_fn2)(CPUState *);
+ };
+
+Note that there is no need to provide typedefs for QOM structures
+since these are generated automatically by the QOM declaration macros.
+See :ref:`qom` for more details.
+
+QEMU GUARD macros
+=================
+
+QEMU provides a number of ``_GUARD`` macros intended to make the
+handling of multiple exit paths easier. For example using
+``QEMU_LOCK_GUARD`` to take a lock will ensure the lock is released on
+exit from the function.
+
+.. code-block:: c
+
+ static int my_critical_function(SomeState *s, void *data)
+ {
+ QEMU_LOCK_GUARD(&s->lock);
+ do_thing1(data);
+ if (check_state2(data)) {
+ return -1;
+ }
+ do_thing3(data);
+ return 0;
+ }
+
+will ensure s->lock is released however the function is exited. The
+equivalent code without _GUARD macro makes us to carefully put
+qemu_mutex_unlock() on all exit points:
+
+.. code-block:: c
+
+ static int my_critical_function(SomeState *s, void *data)
+ {
+ qemu_mutex_lock(&s->lock);
+ do_thing1(data);
+ if (check_state2(data)) {
+ qemu_mutex_unlock(&s->lock);
+ return -1;
+ }
+ do_thing3(data);
+ qemu_mutex_unlock(&s->lock);
+ return 0;
+ }
+
+There are often ``WITH_`` forms of macros which more easily wrap
+around a block inside a function.
+
+.. code-block:: c
+
+ WITH_RCU_READ_LOCK_GUARD() {
+ QTAILQ_FOREACH_RCU(kid, &bus->children, sibling) {
+ err = do_the_thing(kid->child);
+ if (err < 0) {
+ return err;
+ }
+ }
+ }
+
Error handling and reporting
============================
@@ -686,7 +820,7 @@ Rationale: hex numbers are hard to read in logs when there is no 0x prefix,
especially when (occasionally) the representation doesn't contain any letters
and especially in one line with other decimal numbers. Number groups are allowed
to not use '0x' because for some things notations like %x.%x.%x are used not
-only in Qemu. Also dumping raw data bytes with '0x' is less readable.
+only in QEMU. Also dumping raw data bytes with '0x' is less readable.
'#' printf flag
---------------
diff --git a/docs/devel/submitting-a-patch.rst b/docs/devel/submitting-a-patch.rst
new file mode 100644
index 0000000000..83e9092b8c
--- /dev/null
+++ b/docs/devel/submitting-a-patch.rst
@@ -0,0 +1,593 @@
+.. _submitting-a-patch:
+
+Submitting a Patch
+==================
+
+QEMU welcomes contributions to fix bugs, add functionality or improve
+the documentation. However, we get a lot of patches, and so we have
+some guidelines about submitting them. If you follow these, you'll
+help make our task of contribution review easier and your change is
+likely to be accepted and committed faster.
+
+This page seems very long, so if you are only trying to post a quick
+one-shot fix, the bare minimum we ask is that:
+
+.. list-table:: Minimal Checklist for Patches
+ :widths: 35 65
+ :header-rows: 1
+
+ * - Check
+ - Reason
+ * - Patches contain Signed-off-by: Real Name <author@email>
+ - States you are legally able to contribute the code. See :ref:`patch_emails_must_include_a_signed_off_by_line`
+ * - Sent as patch emails to ``qemu-devel@nongnu.org``
+ - The project uses an email list based workflow. See :ref:`submitting_your_patches`
+ * - Be prepared to respond to review comments
+ - Code that doesn't pass review will not get merged. See :ref:`participating_in_code_review`
+
+You do not have to subscribe to post (list policy is to reply-to-all to
+preserve CCs and keep non-subscribers in the loop on the threads they
+start), although you may find it easier as a subscriber to pick up good
+ideas from other posts. If you do subscribe, be prepared for a high
+volume of email, often over one thousand messages in a week. The list is
+moderated; first-time posts from an email address (whether or not you
+subscribed) may be subject to some delay while waiting for a moderator
+to allow your address.
+
+The larger your contribution is, or if you plan on becoming a long-term
+contributor, then the more important the rest of this page becomes.
+Reading the table of contents below should already give you an idea of
+the basic requirements. Use the table of contents as a reference, and
+read the parts that you have doubts about.
+
+.. contents:: Table of Contents
+
+.. _writing_your_patches:
+
+Writing your Patches
+--------------------
+
+.. _use_the_qemu_coding_style:
+
+Use the QEMU coding style
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can run run *scripts/checkpatch.pl <patchfile>* before submitting to
+check that you are in compliance with our coding standards. Be aware
+that ``checkpatch.pl`` is not infallible, though, especially where C
+preprocessor macros are involved; use some common sense too. See also:
+
+- :ref:`coding-style`
+- `Automate a checkpatch run on
+ commit <https://blog.vmsplice.net/2011/03/how-to-automatically-run-checkpatchpl.html>`__
+
+.. _base_patches_against_current_git_master:
+
+Base patches against current git master
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There's no point submitting a patch which is based on a released version
+of QEMU because development will have moved on from then and it probably
+won't even apply to master. We only apply selected bugfixes to release
+branches and then only as backports once the code has gone into master.
+
+It is also okay to base patches on top of other on-going work that is
+not yet part of the git master branch. To aid continuous integration
+tools, such as `patchew <http://patchew.org/QEMU/>`__, you should `add a
+tag <https://lists.gnu.org/archive/html/qemu-devel/2017-08/msg01288.html>`__
+line ``Based-on: $MESSAGE_ID`` to your cover letter to make the series
+dependency obvious.
+
+.. _split_up_long_patches:
+
+Split up long patches
+~~~~~~~~~~~~~~~~~~~~~
+
+Split up longer patches into a patch series of logical code changes.
+Each change should compile and execute successfully. For instance, don't
+add a file to the makefile in patch one and then add the file itself in
+patch two. (This rule is here so that people can later use tools like
+`git bisect <http://git-scm.com/docs/git-bisect>`__ without hitting
+points in the commit history where QEMU doesn't work for reasons
+unrelated to the bug they're chasing.) Put documentation first, not
+last, so that someone reading the series can do a clean-room evaluation
+of the documentation, then validate that the code matched the
+documentation. A commit message that mentions "Also, ..." is often a
+good candidate for splitting into multiple patches. For more thoughts on
+properly splitting patches and writing good commit messages, see `this
+advice from
+OpenStack <https://wiki.openstack.org/wiki/GitCommitMessages>`__.
+
+.. _make_code_motion_patches_easy_to_review:
+
+Make code motion patches easy to review
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a series requires large blocks of code motion, there are tricks for
+making the refactoring easier to review. Split up the series so that
+semantic changes (or even function renames) are done in a separate patch
+from the raw code motion. Use a one-time setup of ``git config
+diff.renames true;`` ``git config diff.algorithm patience`` (refer to
+`git-config <http://git-scm.com/docs/git-config>`__). The 'diff.renames'
+property ensures file rename patches will be given in a more compact
+representation that focuses only on the differences across the file
+rename, instead of showing the entire old file as a deletion and the new
+file as an insertion. Meanwhile, the 'diff.algorithm' property ensures
+that extracting a non-contiguous subset of one file into a new file, but
+where all extracted parts occur in the same order both before and after
+the patch, will reduce churn in trying to treat unrelated ``}`` lines in
+the original file as separating hunks of changes.
+
+Ideally, a code motion patch can be reviewed by doing::
+
+ git format-patch --stdout -1 > patch;
+ diff -u <(sed -n 's/^-//p' patch) <(sed -n 's/^\+//p' patch)
+
+to focus on the few changes that weren't wholesale code motion.
+
+.. _dont_include_irrelevant_changes:
+
+Don't include irrelevant changes
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In particular, don't include formatting, coding style or whitespace
+changes to bits of code that would otherwise not be touched by the
+patch. (It's OK to fix coding style issues in the immediate area (few
+lines) of the lines you're changing.) If you think a section of code
+really does need a reindent or other large-scale style fix, submit this
+as a separate patch which makes no semantic changes; don't put it in the
+same patch as your bug fix.
+
+For smaller patches in less frequently changed areas of QEMU, consider
+using the :ref:`trivial-patches` process.
+
+.. _write_a_meaningful_commit_message:
+
+Write a meaningful commit message
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Commit messages should be meaningful and should stand on their own as a
+historical record of why the changes you applied were necessary or
+useful.
+
+QEMU follows the usual standard for git commit messages: the first line
+(which becomes the email subject line) is "subsystem: single line
+summary of change". Whether the "single line summary of change" starts
+with a capital is a matter of taste, but we prefer that the summary does
+not end in a dot. Look at ``git shortlog -30`` for an idea of sample
+subject lines. Then there is a blank line and a more detailed
+description of the patch, another blank and your Signed-off-by: line.
+Please do not use lines that are longer than 76 characters in your
+commit message (so that the text still shows up nicely with "git show"
+in a 80-columns terminal window).
+
+The body of the commit message is a good place to document why your
+change is important. Don't include comments like "This is a suggestion
+for fixing this bug" (they can go below the ``---`` line in the email so
+they don't go into the final commit message). Make sure the body of the
+commit message can be read in isolation even if the reader's mailer
+displays the subject line some distance apart (that is, a body that
+starts with "... so that" as a continuation of the subject line is
+harder to follow).
+
+If your patch fixes a commit that is already in the repository, please
+add an additional line with "Fixes: <at-least-12-digits-of-SHA-commit-id>
+("Fixed commit subject")" below the patch description / before your
+"Signed-off-by:" line in the commit message.
+
+If your patch fixes a bug in the gitlab bug tracker, please add a line
+with "Resolves: <URL-of-the-bug>" to the commit message, too. Gitlab can
+close bugs automatically once commits with the "Resolves:" keyword get
+merged into the master branch of the project. And if your patch addresses
+a bug in another public bug tracker, you can also use a line with
+"Buglink: <URL-of-the-bug>" for reference here, too.
+
+Example::
+
+ Fixes: 14055ce53c2d ("s390x/tcg: avoid overflows in time2tod/tod2time")
+ Resolves: https://gitlab.com/qemu-project/qemu/-/issues/42
+ Buglink: https://bugs.launchpad.net/qemu/+bug/1804323``
+
+Some other tags that are used in commit messages include "Message-Id:"
+"Tested-by:", "Acked-by:", "Reported-by:", "Suggested-by:". See ``git
+log`` for these keywords for example usage.
+
+.. _test_your_patches:
+
+Test your patches
+~~~~~~~~~~~~~~~~~
+
+Although QEMU uses various :ref:`ci` services that attempt to test
+patches submitted to the list, it still saves everyone time if you
+have already tested that your patch compiles and works. Because QEMU
+is such a large project the default configuration won't create a
+testing pipeline on GitLab when a branch is pushed. See the :ref:`CI
+variable documentation<ci_var>` for details on how to control the
+running of tests; but it is still wise to also check that your patches
+work with a full build before submitting a series, especially if your
+changes might have an unintended effect on other areas of the code you
+don't normally experiment with. See :ref:`testing` for more details on
+what tests are available.
+
+Also, it is a wise idea to include a testsuite addition as part of
+your patches - either to ensure that future changes won't regress your
+new feature, or to add a test which exposes the bug that the rest of
+your series fixes. Keeping separate commits for the test and the fix
+allows reviewers to rebase the test to occur first to prove it catches
+the problem, then again to place it last in the series so that
+bisection doesn't land on a known-broken state.
+
+.. _submitting_your_patches:
+
+Submitting your Patches
+-----------------------
+
+The QEMU project uses a public email based workflow for reviewing and
+merging patches. As a result all contributions to QEMU must be **sent
+as patches** to the qemu-devel `mailing list
+<https://wiki.qemu.org/Contribute/MailingLists>`__. Patch
+contributions should not be posted on the bug tracker, posted on
+forums, or externally hosted and linked to. (We have other mailing
+lists too, but all patches must go to qemu-devel, possibly with a Cc:
+to another list.) ``git send-email`` (`step-by-step setup guide
+<https://git-send-email.io/>`__ and `hints and tips
+<https://elixir.bootlin.com/linux/latest/source/Documentation/process/email-clients.rst>`__)
+works best for delivering the patch without mangling it, but
+attachments can be used as a last resort on a first-time submission.
+
+.. _if_you_cannot_send_patch_emails:
+
+If you cannot send patch emails
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In rare cases it may not be possible to send properly formatted patch
+emails. You can use `sourcehut <https://sourcehut.org/>`__ to send your
+patches to the QEMU mailing list by following these steps:
+
+#. Register or sign in to your account
+#. Add your SSH public key in `meta \|
+ keys <https://meta.sr.ht/keys>`__.
+#. Publish your git branch using **git push git@git.sr.ht:~USERNAME/qemu
+ HEAD**
+#. Send your patches to the QEMU mailing list using the web-based
+ ``git-send-email`` UI at https://git.sr.ht/~USERNAME/qemu/send-email
+
+`This video
+<https://spacepub.space/videos/watch/ad258d23-0ac6-488c-83fc-2bacf578de3a>`__
+shows the web-based ``git-send-email`` workflow. Documentation is
+available `here
+<https://man.sr.ht/git.sr.ht/#sending-patches-upstream>`__.
+
+.. _cc_the_relevant_maintainer:
+
+CC the relevant maintainer
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send patches both to the mailing list and CC the maintainer(s) of the
+files you are modifying. look in the MAINTAINERS file to find out who
+that is. Also try using scripts/get_maintainer.pl from the repository
+for learning the most common committers for the files you touched.
+
+Example::
+
+ ~/src/qemu/scripts/get_maintainer.pl -f hw/ide/core.c
+
+In fact, you can automate this, via a one-time setup of ``git config
+sendemail.cccmd 'scripts/get_maintainer.pl --nogit-fallback'`` (Refer to
+`git-config <http://git-scm.com/docs/git-config>`__.)
+
+.. _do_not_send_as_an_attachment:
+
+Do not send as an attachment
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Send patches inline so they are easy to reply to with review comments.
+Do not put patches in attachments.
+
+.. _use_git_format_patch:
+
+Use ``git format-patch``
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+Use the right diff format.
+`git format-patch <http://git-scm.com/docs/git-format-patch>`__ will
+produce patch emails in the right format (check the documentation to
+find out how to drive it). You can then edit the cover letter before
+using ``git send-email`` to mail the files to the mailing list. (We
+recommend `git send-email <http://git-scm.com/docs/git-send-email>`__
+because mail clients often mangle patches by wrapping long lines or
+messing up whitespace. Some distributions do not include send-email in a
+default install of git; you may need to download additional packages,
+such as 'git-email' on Fedora-based systems.) Patch series need a cover
+letter, with shallow threading (all patches in the series are
+in-reply-to the cover letter, but not to each other); single unrelated
+patches do not need a cover letter (but if you do send a cover letter,
+use ``--numbered`` so the cover and the patch have distinct subject lines).
+Patches are easier to find if they start a new top-level thread, rather
+than being buried in-reply-to another existing thread.
+
+.. _avoid_posting_large_binary_blob:
+
+Avoid posting large binary blob
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you added binaries to the repository, consider producing the patch
+emails using ``git format-patch --no-binary`` and include a link to a
+git repository to fetch the original commit.
+
+.. _patch_emails_must_include_a_signed_off_by_line:
+
+Patch emails must include a ``Signed-off-by:`` line
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Your patches **must** include a Signed-off-by: line. This is a hard
+requirement because it's how you say "I'm legally okay to contribute
+this and happy for it to go into QEMU". The process is modelled after
+the `Linux kernel
+<http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__
+policy.
+
+If you wrote the patch, make sure your "From:" and "Signed-off-by:"
+lines use the same spelling. It's okay if you subscribe or contribute to
+the list via more than one address, but using multiple addresses in one
+commit just confuses things. If someone else wrote the patch, git will
+include a "From:" line in the body of the email (different from your
+envelope From:) that will give credit to the correct author; but again,
+that author's Signed-off-by: line is mandatory, with the same spelling.
+
+There are various tooling options for automatically adding these tags
+include using ``git commit -s`` or ``git format-patch -s``. For more
+information see `SubmittingPatches 1.12
+<http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/SubmittingPatches?id=f6f94e2ab1b33f0082ac22d71f66385a60d8157f#n297>`__.
+
+.. _include_a_meaningful_cover_letter:
+
+Include a meaningful cover letter
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This is a requirement for any series with multiple patches (as it aids
+continuous integration), but optional for an isolated patch. The cover
+letter explains the overall goal of such a series, and also provides a
+convenient 0/N email for others to reply to the series as a whole. A
+one-time setup of ``git config format.coverletter auto`` (refer to
+`git-config <http://git-scm.com/docs/git-config>`__) will generate the
+cover letter as needed.
+
+When reviewers don't know your goal at the start of their review, they
+may object to early changes that don't make sense until the end of the
+series, because they do not have enough context yet at that point of
+their review. A series where the goal is unclear also risks a higher
+number of review-fix cycles because the reviewers haven't bought into
+the idea yet. If the cover letter can explain these points to the
+reviewer, the process will be smoother patches will get merged faster.
+Make sure your cover letter includes a diffstat of changes made over the
+entire series; potential reviewers know what files they are interested
+in, and they need an easy way determine if your series touches them.
+
+.. _use_the_rfc_tag_if_needed:
+
+Use the RFC tag if needed
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For example, "[PATCH RFC v2]". ``git format-patch --subject-prefix=RFC``
+can help.
+
+"RFC" means "Request For Comments" and is a statement that you don't
+intend for your patchset to be applied to master, but would like some
+review on it anyway. Reasons for doing this include:
+
+- the patch depends on some pending kernel changes which haven't yet
+ been accepted, so the QEMU patch series is blocked until that
+ dependency has been dealt with, but is worth reviewing anyway
+- the patch set is not finished yet (perhaps it doesn't cover all use
+ cases or work with all targets) but you want early review of a major
+ API change or design structure before continuing
+
+In general, since it's asking other people to do review work on a
+patchset that the submitter themselves is saying shouldn't be applied,
+it's best to:
+
+- use it sparingly
+- in the cover letter, be clear about why a patch is an RFC, what areas
+ of the patchset you're looking for review on, and why reviewers
+ should care
+
+.. _consider_whether_your_patch_is_applicable_for_stable:
+
+Consider whether your patch is applicable for stable
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your patch fixes a severe issue or a regression, it may be applicable
+for stable. In that case, consider adding ``Cc: qemu-stable@nongnu.org``
+to your patch to notify the stable maintainers.
+
+For more details on how QEMU's stable process works, refer to the
+:ref:`stable-process` page.
+
+.. _participating_in_code_review:
+
+Participating in Code Review
+----------------------------
+
+All patches submitted to the QEMU project go through a code review
+process before they are accepted. This will often mean a series will
+go through a number of iterations before being picked up by
+:ref:`maintainers<maintainers>`. You therefore should be prepared to
+read replies to your messages and be willing to act on them.
+
+Maintainers are often willing to manually fix up first-time
+contributions, since there is a learning curve involved in making an
+ideal patch submission. However for the best results you should
+proactively respond to suggestions with changes or justifications for
+your current approach.
+
+Some areas of code that are well maintained may review patches
+quickly, lesser-loved areas of code may have a longer delay.
+
+.. _stay_around_to_fix_problems_raised_in_code_review:
+
+Stay around to fix problems raised in code review
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Not many patches get into QEMU straight away -- it is quite common that
+developers will identify bugs, or suggest a cleaner approach, or even
+just point out code style issues or commit message typos. You'll need to
+respond to these, and then send a second version of your patches with
+the issues fixed. This takes a little time and effort on your part, but
+if you don't do it then your changes will never get into QEMU.
+
+Remember that a maintainer is under no obligation to take your
+patches. If someone has spent the time reviewing your code and
+suggesting improvements and you simply re-post without either
+addressing the comment directly or providing additional justification
+for the change then it becomes wasted effort. You cannot demand others
+merge and then fix up your code after the fact.
+
+When replying to comments on your patches **reply to all and not just
+the sender** -- keeping discussion on the mailing list means everybody
+can follow it. Remember the spirit of the :ref:`code_of_conduct` and
+keep discussions respectful and collaborative and avoid making
+personal comments.
+
+.. _pay_attention_to_review_comments:
+
+Pay attention to review comments
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Someone took their time to review your work, and it pays to respect that
+effort; repeatedly submitting a series without addressing all comments
+from the previous round tends to alienate reviewers and stall your
+patch. Reviewers aren't always perfect, so it is okay if you want to
+argue that your code was correct in the first place instead of blindly
+doing everything the reviewer asked. On the other hand, if someone
+pointed out a potential issue during review, then even if your code
+turns out to be correct, it's probably a sign that you should improve
+your commit message and/or comments in the code explaining why the code
+is correct.
+
+If you fix issues that are raised during review **resend the entire
+patch series** not just the one patch that was changed. This allows
+maintainers to easily apply the fixed series without having to manually
+identify which patches are relevant. Send the new version as a complete
+fresh email or series of emails -- don't try to make it a followup to
+version 1. (This helps automatic patch email handling tools distinguish
+between v1 and v2 emails.)
+
+.. _when_resending_patches_add_a_version_tag:
+
+When resending patches add a version tag
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+All patches beyond the first version should include a version tag -- for
+example, "[PATCH v2]". This means people can easily identify whether
+they're looking at the most recent version. (The first version of a
+patch need not say "v1", just [PATCH] is sufficient.) For patch series,
+the version applies to the whole series -- even if you only change one
+patch, you resend the entire series and mark it as "v2". Don't try to
+track versions of different patches in the series separately. `git
+format-patch <http://git-scm.com/docs/git-format-patch>`__ and `git
+send-email <http://git-scm.com/docs/git-send-email>`__ both understand
+the ``-v2`` option to make this easier. Send each new revision as a new
+top-level thread, rather than burying it in-reply-to an earlier
+revision, as many reviewers are not looking inside deep threads for new
+patches.
+
+.. _include_version_history_in_patchset_revisions:
+
+Include version history in patchset revisions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For later versions of patches, include a summary of changes from
+previous versions, but not in the commit message itself. In an email
+formatted as a git patch, the commit message is the part above the ``---``
+line, and this will go into the git changelog when the patch is
+committed. This part should be a self-contained description of what this
+version of the patch does, written to make sense to anybody who comes
+back to look at this commit in git in six months' time. The part below
+the ``---`` line and above the patch proper (git format-patch puts the
+diffstat here) is a good place to put remarks for people reading the
+patch email, and this is where the "changes since previous version"
+summary belongs. The `git-publish
+<https://github.com/stefanha/git-publish>`__ script can help with
+tracking a good summary across versions. Also, the `git-backport-diff
+<https://github.com/codyprime/git-scripts>`__ script can help focus
+reviewers on what changed between revisions.
+
+.. _tips_and_tricks:
+
+Tips and Tricks
+---------------
+
+.. _proper_use_of_reviewed_by_tags_can_aid_review:
+
+Proper use of Reviewed-by: tags can aid review
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+When reviewing a large series, a reviewer can reply to some of the
+patches with a Reviewed-by tag, stating that they are happy with that
+patch in isolation (sometimes conditional on minor cleanup, like fixing
+whitespace, that doesn't affect code content). You should then update
+those commit messages by hand to include the Reviewed-by tag, so that in
+the next revision, reviewers can spot which patches were already clean
+from the previous round. Conversely, if you significantly modify a patch
+that was previously reviewed, remove the reviewed-by tag out of the
+commit message, as well as listing the changes from the previous
+version, to make it easier to focus a reviewer's attention to your
+changes.
+
+.. _if_your_patch_seems_to_have_been_ignored:
+
+If your patch seems to have been ignored
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your patchset has received no replies you should "ping" it after a
+week or two, by sending an email as a reply-to-all to the patch mail,
+including the word "ping" and ideally also a link to the page for the
+patch on `patchew <https://patchew.org/QEMU/>`__ or
+`lore.kernel.org <https://lore.kernel.org/qemu-devel/>`__. It's worth
+double-checking for reasons why your patch might have been ignored
+(forgot to CC the maintainer? annoyed people by failing to respond to
+review comments on an earlier version?), but often for less-maintained
+areas of QEMU patches do just slip through the cracks. If your ping is
+also ignored, ping again after another week or so. As the submitter, you
+are the person with the most motivation to get your patch applied, so
+you have to be persistent.
+
+.. _is_my_patch_in:
+
+Is my patch in?
+~~~~~~~~~~~~~~~
+
+QEMU has some Continuous Integration machines that try to catch patch
+submission problems as soon as possible. `patchew
+<http://patchew.org/QEMU/>`__ includes a web interface for tracking the
+status of various threads that have been posted to the list, and may
+send you an automated mail if it detected a problem with your patch.
+
+Once your patch has had enough review on list, the maintainer for that
+area of code will send notification to the list that they are including
+your patch in a particular staging branch. Periodically, the maintainer
+then takes care of :ref:`submitting-a-pull-request`
+for aggregating topic branches into mainline QEMU. Generally, you do not
+need to send a pull request unless you have contributed enough patches
+to become a maintainer over a particular section of code. Maintainers
+may further modify your commit, by resolving simple merge conflicts or
+fixing minor typos pointed out during review, but will always add a
+Signed-off-by line in addition to yours, indicating that it went through
+their tree. Occasionally, the maintainer's pull request may hit more
+difficult merge conflicts, where you may be requested to help rebase and
+resolve the problems. It may take a couple of weeks between when your
+patch first had a positive review to when it finally lands in qemu.git;
+release cycle freezes may extend that time even longer.
+
+.. _return_the_favor:
+
+Return the favor
+~~~~~~~~~~~~~~~~
+
+Peer review only works if everyone chips in a bit of review time. If
+everyone submitted more patches than they reviewed, we would have a
+patch backlog. A good goal is to try to review at least as many patches
+from others as what you submit. Don't worry if you don't know the code
+base as well as a maintainer; it's perfectly fine to admit when your
+review is weak because you are unfamiliar with the code.
diff --git a/docs/devel/submitting-a-pull-request.rst b/docs/devel/submitting-a-pull-request.rst
new file mode 100644
index 0000000000..a4cd7ebbb6
--- /dev/null
+++ b/docs/devel/submitting-a-pull-request.rst
@@ -0,0 +1,73 @@
+.. _submitting-a-pull-request:
+
+Submitting a Pull Request
+=========================
+
+QEMU welcomes contributions of code, but we generally expect these to be
+sent as simple patch emails to the mailing list (see our page on
+:ref:`submitting-a-patch`
+for more details). Generally only existing submaintainers of a tree
+will need to submit pull requests, although occasionally for a large
+patch series we might ask a submitter to send a pull request. This page
+documents our recommendations on pull requests for those people.
+
+A good rule of thumb is not to send a pull request unless somebody asks
+you to.
+
+**Resend the patches with the pull request** as emails which are
+threaded as follow-ups to the pull request itself. The simplest way to
+do this is to use ``git format-patch --cover-letter`` to create the
+emails, and then edit the cover letter to include the pull request
+details that ``git request-pull`` outputs.
+
+**Use PULL as the subject line tag** in both the cover letter and the
+retransmitted patch mails (for example, by using
+``--subject-prefix=PULL`` in your ``git format-patch`` command). This
+helps people to filter in or out the resulting emails (especially useful
+if they are only CC'd on one email out of the set).
+
+**Each patch must have your own Signed-off-by: line** as well as that of
+the original author if the patch was not written by you. This is because
+with a pull request you're now indicating that the patch has passed via
+you rather than directly from the original author.
+
+**Don't forget to add Reviewed-by: and Acked-by: lines**. When other
+people have reviewed the patches you're putting in the pull request,
+make sure you've copied their signoffs across. (If you use the `patches
+tool <https://github.com/stefanha/patches>`__ to add patches from email
+directly to your git repo it will include the tags automatically; if
+you're updating patches manually or in some other way you'll need to
+edit the commit messages by hand.)
+
+**Don't send pull requests for code that hasn't passed review**. A pull
+request says these patches are ready to go into QEMU now, so they must
+have passed the standard code review processes. In particular if you've
+corrected issues in one round of code review, you need to send your
+fixed patch series as normal to the list; you can't put it in a pull
+request until it's gone through. (Extremely trivial fixes may be OK to
+just fix in passing, but if in doubt err on the side of not.)
+
+**Test before sending**. This is an obvious thing to say, but make sure
+everything builds (including that it compiles at each step of the patch
+series) and that "make check" passes before sending out the pull
+request. As a submaintainer you're one of QEMU's lines of defense
+against bad code, so double check the details.
+
+**All pull requests must be signed**. By "signed" here we mean that
+the pullreq email should quote a tag which is a GPG-signed tag (as
+created with 'gpg tag -s ...'). See :ref:`maintainer_keys` for
+details.
+
+**Pull requests not for master should say "not for master" and have
+"PULL SUBSYSTEM whatever" in the subject tag**. If your pull request is
+targeting a stable branch or some submaintainer tree, please include the
+string "not for master" in the cover letter email, and make sure the
+subject tag is "PULL SUBSYSTEM s390/block/whatever" rather than just
+"PULL". This allows it to be automatically filtered out of the set of
+pull requests that should be applied to master.
+
+You might be interested in the `make-pullreq
+<https://git.linaro.org/people/peter.maydell/misc-scripts.git/tree/make-pullreq>`__
+script which automates some of this process for you and includes a few
+sanity checks. Note that you must edit it to configure it suitably for
+your local situation!
diff --git a/docs/devel/tcg-icount.rst b/docs/devel/tcg-icount.rst
index 50c8e8dabc..7df883446a 100644
--- a/docs/devel/tcg-icount.rst
+++ b/docs/devel/tcg-icount.rst
@@ -62,12 +62,6 @@ To deal with this case, when an I/O access is made we:
- re-compile a single [1]_ instruction block for the current PC
- exit the cpu loop and execute the re-compiled block
-The new block is created with the CF_LAST_IO compile flag which
-ensures the final instruction translation starts with a call to
-gen_io_start() so we don't enter a perpetual loop constantly
-recompiling a single instruction block. For translators using the
-common translator_loop this is done automatically.
-
.. [1] sometimes two instructions if dealing with delay slots
Other I/O operations
diff --git a/docs/devel/tcg-ops.rst b/docs/devel/tcg-ops.rst
new file mode 100644
index 0000000000..d46b625e0e
--- /dev/null
+++ b/docs/devel/tcg-ops.rst
@@ -0,0 +1,979 @@
+.. _tcg-ops-ref:
+
+*******************************
+TCG Intermediate Representation
+*******************************
+
+Introduction
+============
+
+TCG (Tiny Code Generator) began as a generic backend for a C compiler.
+It was simplified to be used in QEMU. It also has its roots in the
+QOP code generator written by Paul Brook.
+
+Definitions
+===========
+
+The TCG *target* is the architecture for which we generate the code.
+It is of course not the same as the "target" of QEMU which is the
+emulated architecture. As TCG started as a generic C backend used
+for cross compiling, the assumption was that TCG target might be
+different from the host, although this is never the case for QEMU.
+
+In this document, we use *guest* to specify what architecture we are
+emulating; *target* always means the TCG target, the machine on which
+we are running QEMU.
+
+An operation with *undefined behavior* may result in a crash.
+
+An operation with *unspecified behavior* shall not crash. However,
+the result may be one of several possibilities so may be considered
+an *undefined result*.
+
+Basic Blocks
+============
+
+A TCG *basic block* is a single entry, multiple exit region which
+corresponds to a list of instructions terminated by a label, or
+any branch instruction.
+
+A TCG *extended basic block* is a single entry, multiple exit region
+which corresponds to a list of instructions terminated by a label or
+an unconditional branch. Specifically, an extended basic block is
+a sequence of basic blocks connected by the fall-through paths of
+zero or more conditional branch instructions.
+
+Operations
+==========
+
+TCG instructions or *ops* operate on TCG *variables*, both of which
+are strongly typed. Each instruction has a fixed number of output
+variable operands, input variable operands and constant operands.
+Vector instructions have a field specifying the element size within
+the vector. The notable exception is the call instruction which has
+a variable number of outputs and inputs.
+
+In the textual form, output operands usually come first, followed by
+input operands, followed by constant operands. The output type is
+included in the instruction name. Constants are prefixed with a '$'.
+
+.. code-block:: none
+
+ add_i32 t0, t1, t2 /* (t0 <- t1 + t2) */
+
+Variables
+=========
+
+* ``TEMP_FIXED``
+
+ There is one TCG *fixed global* variable, ``cpu_env``, which is
+ live in all translation blocks, and holds a pointer to ``CPUArchState``.
+ This variable is held in a host cpu register at all times in all
+ translation blocks.
+
+* ``TEMP_GLOBAL``
+
+ A TCG *global* is a variable which is live in all translation blocks,
+ and corresponds to memory location that is within ``CPUArchState``.
+ These may be specified as an offset from ``cpu_env``, in which case
+ they are called *direct globals*, or may be specified as an offset
+ from a direct global, in which case they are called *indirect globals*.
+ Even indirect globals should still reference memory within
+ ``CPUArchState``. All TCG globals are defined during
+ ``TCGCPUOps.initialize``, before any translation blocks are generated.
+
+* ``TEMP_CONST``
+
+ A TCG *constant* is a variable which is live throughout the entire
+ translation block, and contains a constant value. These variables
+ are allocated on demand during translation and are hashed so that
+ there is exactly one variable holding a given value.
+
+* ``TEMP_TB``
+
+ A TCG *translation block temporary* is a variable which is live
+ throughout the entire translation block, but dies on any exit.
+ These temporaries are allocated explicitly during translation.
+
+* ``TEMP_EBB``
+
+ A TCG *extended basic block temporary* is a variable which is live
+ throughout an extended basic block, but dies on any exit.
+ These temporaries are allocated explicitly during translation.
+
+Types
+=====
+
+* ``TCG_TYPE_I32``
+
+ A 32-bit integer.
+
+* ``TCG_TYPE_I64``
+
+ A 64-bit integer. For 32-bit hosts, such variables are split into a pair
+ of variables with ``type=TCG_TYPE_I32`` and ``base_type=TCG_TYPE_I64``.
+ The ``temp_subindex`` for each indicates where it falls within the
+ host-endian representation.
+
+* ``TCG_TYPE_PTR``
+
+ An alias for ``TCG_TYPE_I32`` or ``TCG_TYPE_I64``, depending on the size
+ of a pointer for the host.
+
+* ``TCG_TYPE_REG``
+
+ An alias for ``TCG_TYPE_I32`` or ``TCG_TYPE_I64``, depending on the size
+ of the integer registers for the host. This may be larger
+ than ``TCG_TYPE_PTR`` depending on the host ABI.
+
+* ``TCG_TYPE_I128``
+
+ A 128-bit integer. For all hosts, such variables are split into a number
+ of variables with ``type=TCG_TYPE_REG`` and ``base_type=TCG_TYPE_I128``.
+ The ``temp_subindex`` for each indicates where it falls within the
+ host-endian representation.
+
+* ``TCG_TYPE_V64``
+
+ A 64-bit vector. This type is valid only if the TCG target
+ sets ``TCG_TARGET_HAS_v64``.
+
+* ``TCG_TYPE_V128``
+
+ A 128-bit vector. This type is valid only if the TCG target
+ sets ``TCG_TARGET_HAS_v128``.
+
+* ``TCG_TYPE_V256``
+
+ A 256-bit vector. This type is valid only if the TCG target
+ sets ``TCG_TARGET_HAS_v256``.
+
+Helpers
+=======
+
+Helpers are registered in a guest-specific ``helper.h``,
+which is processed to generate ``tcg_gen_helper_*`` functions.
+With these functions it is possible to call a function taking
+i32, i64, i128 or pointer types.
+
+By default, before calling a helper, all globals are stored at their
+canonical location. By default, the helper is allowed to modify the
+CPU state (including the state represented by tcg globals)
+or may raise an exception. This default can be overridden using the
+following function modifiers:
+
+* ``TCG_CALL_NO_WRITE_GLOBALS``
+
+ The helper does not modify any globals, but may read them.
+ Globals will be saved to their canonical location before calling helpers,
+ but need not be reloaded afterwards.
+
+* ``TCG_CALL_NO_READ_GLOBALS``
+
+ The helper does not read globals, either directly or via an exception.
+ They will not be saved to their canonical locations before calling
+ the helper. This implies ``TCG_CALL_NO_WRITE_GLOBALS``.
+
+* ``TCG_CALL_NO_SIDE_EFFECTS``
+
+ The call to the helper function may be removed if the return value is
+ not used. This means that it may not modify any CPU state nor may it
+ raise an exception.
+
+Code Optimizations
+==================
+
+When generating instructions, you can count on at least the following
+optimizations:
+
+- Single instructions are simplified, e.g.
+
+ .. code-block:: none
+
+ and_i32 t0, t0, $0xffffffff
+
+ is suppressed.
+
+- A liveness analysis is done at the basic block level. The
+ information is used to suppress moves from a dead variable to
+ another one. It is also used to remove instructions which compute
+ dead results. The later is especially useful for condition code
+ optimization in QEMU.
+
+ In the following example:
+
+ .. code-block:: none
+
+ add_i32 t0, t1, t2
+ add_i32 t0, t0, $1
+ mov_i32 t0, $1
+
+ only the last instruction is kept.
+
+
+Instruction Reference
+=====================
+
+Function call
+-------------
+
+.. list-table::
+
+ * - call *<ret>* *<params>* ptr
+
+ - | call function 'ptr' (pointer type)
+ |
+ | *<ret>* optional 32 bit or 64 bit return value
+ | *<params>* optional 32 bit or 64 bit parameters
+
+Jumps/Labels
+------------
+
+.. list-table::
+
+ * - set_label $label
+
+ - | Define label 'label' at the current program point.
+
+ * - br $label
+
+ - | Jump to label.
+
+ * - brcond_i32/i64 *t0*, *t1*, *cond*, *label*
+
+ - | Conditional jump if *t0* *cond* *t1* is true. *cond* can be:
+ |
+ | ``TCG_COND_EQ``
+ | ``TCG_COND_NE``
+ | ``TCG_COND_LT /* signed */``
+ | ``TCG_COND_GE /* signed */``
+ | ``TCG_COND_LE /* signed */``
+ | ``TCG_COND_GT /* signed */``
+ | ``TCG_COND_LTU /* unsigned */``
+ | ``TCG_COND_GEU /* unsigned */``
+ | ``TCG_COND_LEU /* unsigned */``
+ | ``TCG_COND_GTU /* unsigned */``
+ | ``TCG_COND_TSTEQ /* t1 & t2 == 0 */``
+ | ``TCG_COND_TSTNE /* t1 & t2 != 0 */``
+
+Arithmetic
+----------
+
+.. list-table::
+
+ * - add_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* + *t2*
+
+ * - sub_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* - *t2*
+
+ * - neg_i32/i64 *t0*, *t1*
+
+ - | *t0* = -*t1* (two's complement)
+
+ * - mul_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* * *t2*
+
+ * - div_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* / *t2* (signed)
+ | Undefined behavior if division by zero or overflow.
+
+ * - divu_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* / *t2* (unsigned)
+ | Undefined behavior if division by zero.
+
+ * - rem_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* % *t2* (signed)
+ | Undefined behavior if division by zero or overflow.
+
+ * - remu_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* % *t2* (unsigned)
+ | Undefined behavior if division by zero.
+
+
+Logical
+-------
+
+.. list-table::
+
+ * - and_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* & *t2*
+
+ * - or_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* | *t2*
+
+ * - xor_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* ^ *t2*
+
+ * - not_i32/i64 *t0*, *t1*
+
+ - | *t0* = ~\ *t1*
+
+ * - andc_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* & ~\ *t2*
+
+ * - eqv_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = ~(*t1* ^ *t2*), or equivalently, *t0* = *t1* ^ ~\ *t2*
+
+ * - nand_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = ~(*t1* & *t2*)
+
+ * - nor_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = ~(*t1* | *t2*)
+
+ * - orc_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* | ~\ *t2*
+
+ * - clz_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* ? clz(*t1*) : *t2*
+
+ * - ctz_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* ? ctz(*t1*) : *t2*
+
+ * - ctpop_i32/i64 *t0*, *t1*
+
+ - | *t0* = number of bits set in *t1*
+ |
+ | With *ctpop* short for "count population", matching
+ | the function name used in ``include/qemu/host-utils.h``.
+
+
+Shifts/Rotates
+--------------
+
+.. list-table::
+
+ * - shl_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* << *t2*
+ | Unspecified behavior if *t2* < 0 or *t2* >= 32 (resp 64)
+
+ * - shr_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* >> *t2* (unsigned)
+ | Unspecified behavior if *t2* < 0 or *t2* >= 32 (resp 64)
+
+ * - sar_i32/i64 *t0*, *t1*, *t2*
+
+ - | *t0* = *t1* >> *t2* (signed)
+ | Unspecified behavior if *t2* < 0 or *t2* >= 32 (resp 64)
+
+ * - rotl_i32/i64 *t0*, *t1*, *t2*
+
+ - | Rotation of *t2* bits to the left
+ | Unspecified behavior if *t2* < 0 or *t2* >= 32 (resp 64)
+
+ * - rotr_i32/i64 *t0*, *t1*, *t2*
+
+ - | Rotation of *t2* bits to the right.
+ | Unspecified behavior if *t2* < 0 or *t2* >= 32 (resp 64)
+
+
+Misc
+----
+
+.. list-table::
+
+ * - mov_i32/i64 *t0*, *t1*
+
+ - | *t0* = *t1*
+ | Move *t1* to *t0* (both operands must have the same type).
+
+ * - ext8s_i32/i64 *t0*, *t1*
+
+ ext8u_i32/i64 *t0*, *t1*
+
+ ext16s_i32/i64 *t0*, *t1*
+
+ ext16u_i32/i64 *t0*, *t1*
+
+ ext32s_i64 *t0*, *t1*
+
+ ext32u_i64 *t0*, *t1*
+
+ - | 8, 16 or 32 bit sign/zero extension (both operands must have the same type)
+
+ * - bswap16_i32/i64 *t0*, *t1*, *flags*
+
+ - | 16 bit byte swap on the low bits of a 32/64 bit input.
+ |
+ | If *flags* & ``TCG_BSWAP_IZ``, then *t1* is known to be zero-extended from bit 15.
+ | If *flags* & ``TCG_BSWAP_OZ``, then *t0* will be zero-extended from bit 15.
+ | If *flags* & ``TCG_BSWAP_OS``, then *t0* will be sign-extended from bit 15.
+ |
+ | If neither ``TCG_BSWAP_OZ`` nor ``TCG_BSWAP_OS`` are set, then the bits of *t0* above bit 15 may contain any value.
+
+ * - bswap32_i64 *t0*, *t1*, *flags*
+
+ - | 32 bit byte swap on a 64-bit value. The flags are the same as for bswap16,
+ except they apply from bit 31 instead of bit 15.
+
+ * - bswap32_i32 *t0*, *t1*, *flags*
+
+ bswap64_i64 *t0*, *t1*, *flags*
+
+ - | 32/64 bit byte swap. The flags are ignored, but still present
+ for consistency with the other bswap opcodes.
+
+ * - discard_i32/i64 *t0*
+
+ - | Indicate that the value of *t0* won't be used later. It is useful to
+ force dead code elimination.
+
+ * - deposit_i32/i64 *dest*, *t1*, *t2*, *pos*, *len*
+
+ - | Deposit *t2* as a bitfield into *t1*, placing the result in *dest*.
+ |
+ | The bitfield is described by *pos*/*len*, which are immediate values:
+ |
+ | *len* - the length of the bitfield
+ | *pos* - the position of the first bit, counting from the LSB
+ |
+ | For example, "deposit_i32 dest, t1, t2, 8, 4" indicates a 4-bit field
+ at bit 8. This operation would be equivalent to
+ |
+ | *dest* = (*t1* & ~0x0f00) | ((*t2* << 8) & 0x0f00)
+
+ * - extract_i32/i64 *dest*, *t1*, *pos*, *len*
+
+ sextract_i32/i64 *dest*, *t1*, *pos*, *len*
+
+ - | Extract a bitfield from *t1*, placing the result in *dest*.
+ |
+ | The bitfield is described by *pos*/*len*, which are immediate values,
+ as above for deposit. For extract_*, the result will be extended
+ to the left with zeros; for sextract_*, the result will be extended
+ to the left with copies of the bitfield sign bit at *pos* + *len* - 1.
+ |
+ | For example, "sextract_i32 dest, t1, 8, 4" indicates a 4-bit field
+ at bit 8. This operation would be equivalent to
+ |
+ | *dest* = (*t1* << 20) >> 28
+ |
+ | (using an arithmetic right shift).
+
+ * - extract2_i32/i64 *dest*, *t1*, *t2*, *pos*
+
+ - | For N = {32,64}, extract an N-bit quantity from the concatenation
+ of *t2*:*t1*, beginning at *pos*. The tcg_gen_extract2_{i32,i64} expander
+ accepts 0 <= *pos* <= N as inputs. The backend code generator will
+ not see either 0 or N as inputs for these opcodes.
+
+ * - extrl_i64_i32 *t0*, *t1*
+
+ - | For 64-bit hosts only, extract the low 32-bits of input *t1* and place it
+ into 32-bit output *t0*. Depending on the host, this may be a simple move,
+ or may require additional canonicalization.
+
+ * - extrh_i64_i32 *t0*, *t1*
+
+ - | For 64-bit hosts only, extract the high 32-bits of input *t1* and place it
+ into 32-bit output *t0*. Depending on the host, this may be a simple shift,
+ or may require additional canonicalization.
+
+
+Conditional moves
+-----------------
+
+.. list-table::
+
+ * - setcond_i32/i64 *dest*, *t1*, *t2*, *cond*
+
+ - | *dest* = (*t1* *cond* *t2*)
+ |
+ | Set *dest* to 1 if (*t1* *cond* *t2*) is true, otherwise set to 0.
+
+ * - negsetcond_i32/i64 *dest*, *t1*, *t2*, *cond*
+
+ - | *dest* = -(*t1* *cond* *t2*)
+ |
+ | Set *dest* to -1 if (*t1* *cond* *t2*) is true, otherwise set to 0.
+
+ * - movcond_i32/i64 *dest*, *c1*, *c2*, *v1*, *v2*, *cond*
+
+ - | *dest* = (*c1* *cond* *c2* ? *v1* : *v2*)
+ |
+ | Set *dest* to *v1* if (*c1* *cond* *c2*) is true, otherwise set to *v2*.
+
+
+Type conversions
+----------------
+
+.. list-table::
+
+ * - ext_i32_i64 *t0*, *t1*
+
+ - | Convert *t1* (32 bit) to *t0* (64 bit) and does sign extension
+
+ * - extu_i32_i64 *t0*, *t1*
+
+ - | Convert *t1* (32 bit) to *t0* (64 bit) and does zero extension
+
+ * - trunc_i64_i32 *t0*, *t1*
+
+ - | Truncate *t1* (64 bit) to *t0* (32 bit)
+
+ * - concat_i32_i64 *t0*, *t1*, *t2*
+
+ - | Construct *t0* (64-bit) taking the low half from *t1* (32 bit) and the high half
+ from *t2* (32 bit).
+
+ * - concat32_i64 *t0*, *t1*, *t2*
+
+ - | Construct *t0* (64-bit) taking the low half from *t1* (64 bit) and the high half
+ from *t2* (64 bit).
+
+
+Load/Store
+----------
+
+.. list-table::
+
+ * - ld_i32/i64 *t0*, *t1*, *offset*
+
+ ld8s_i32/i64 *t0*, *t1*, *offset*
+
+ ld8u_i32/i64 *t0*, *t1*, *offset*
+
+ ld16s_i32/i64 *t0*, *t1*, *offset*
+
+ ld16u_i32/i64 *t0*, *t1*, *offset*
+
+ ld32s_i64 t0, *t1*, *offset*
+
+ ld32u_i64 t0, *t1*, *offset*
+
+ - | *t0* = read(*t1* + *offset*)
+ |
+ | Load 8, 16, 32 or 64 bits with or without sign extension from host memory.
+ *offset* must be a constant.
+
+ * - st_i32/i64 *t0*, *t1*, *offset*
+
+ st8_i32/i64 *t0*, *t1*, *offset*
+
+ st16_i32/i64 *t0*, *t1*, *offset*
+
+ st32_i64 *t0*, *t1*, *offset*
+
+ - | write(*t0*, *t1* + *offset*)
+ |
+ | Write 8, 16, 32 or 64 bits to host memory.
+
+All this opcodes assume that the pointed host memory doesn't correspond
+to a global. In the latter case the behaviour is unpredictable.
+
+
+Multiword arithmetic support
+----------------------------
+
+.. list-table::
+
+ * - add2_i32/i64 *t0_low*, *t0_high*, *t1_low*, *t1_high*, *t2_low*, *t2_high*
+
+ sub2_i32/i64 *t0_low*, *t0_high*, *t1_low*, *t1_high*, *t2_low*, *t2_high*
+
+ - | Similar to add/sub, except that the double-word inputs *t1* and *t2* are
+ formed from two single-word arguments, and the double-word output *t0*
+ is returned in two single-word outputs.
+
+ * - mulu2_i32/i64 *t0_low*, *t0_high*, *t1*, *t2*
+
+ - | Similar to mul, except two unsigned inputs *t1* and *t2* yielding the full
+ double-word product *t0*. The latter is returned in two single-word outputs.
+
+ * - muls2_i32/i64 *t0_low*, *t0_high*, *t1*, *t2*
+
+ - | Similar to mulu2, except the two inputs *t1* and *t2* are signed.
+
+ * - mulsh_i32/i64 *t0*, *t1*, *t2*
+
+ muluh_i32/i64 *t0*, *t1*, *t2*
+
+ - | Provide the high part of a signed or unsigned multiply, respectively.
+ |
+ | If mulu2/muls2 are not provided by the backend, the tcg-op generator
+ can obtain the same results by emitting a pair of opcodes, mul + muluh/mulsh.
+
+
+Memory Barrier support
+----------------------
+
+.. list-table::
+
+ * - mb *<$arg>*
+
+ - | Generate a target memory barrier instruction to ensure memory ordering
+ as being enforced by a corresponding guest memory barrier instruction.
+ |
+ | The ordering enforced by the backend may be stricter than the ordering
+ required by the guest. It cannot be weaker. This opcode takes a constant
+ argument which is required to generate the appropriate barrier
+ instruction. The backend should take care to emit the target barrier
+ instruction only when necessary i.e., for SMP guests and when MTTCG is
+ enabled.
+ |
+ | The guest translators should generate this opcode for all guest instructions
+ which have ordering side effects.
+ |
+ | Please see :ref:`atomics-ref` for more information on memory barriers.
+
+
+64-bit guest on 32-bit host support
+-----------------------------------
+
+The following opcodes are internal to TCG. Thus they are to be implemented by
+32-bit host code generators, but are not to be emitted by guest translators.
+They are emitted as needed by inline functions within ``tcg-op.h``.
+
+.. list-table::
+
+ * - brcond2_i32 *t0_low*, *t0_high*, *t1_low*, *t1_high*, *cond*, *label*
+
+ - | Similar to brcond, except that the 64-bit values *t0* and *t1*
+ are formed from two 32-bit arguments.
+
+ * - setcond2_i32 *dest*, *t1_low*, *t1_high*, *t2_low*, *t2_high*, *cond*
+
+ - | Similar to setcond, except that the 64-bit values *t1* and *t2* are
+ formed from two 32-bit arguments. The result is a 32-bit value.
+
+
+QEMU specific operations
+------------------------
+
+.. list-table::
+
+ * - exit_tb *t0*
+
+ - | Exit the current TB and return the value *t0* (word type).
+
+ * - goto_tb *index*
+
+ - | Exit the current TB and jump to the TB index *index* (constant) if the
+ current TB was linked to this TB. Otherwise execute the next
+ instructions. Only indices 0 and 1 are valid and tcg_gen_goto_tb may be issued
+ at most once with each slot index per TB.
+
+ * - lookup_and_goto_ptr *tb_addr*
+
+ - | Look up a TB address *tb_addr* and jump to it if valid. If not valid,
+ jump to the TCG epilogue to go back to the exec loop.
+ |
+ | This operation is optional. If the TCG backend does not implement the
+ goto_ptr opcode, emitting this op is equivalent to emitting exit_tb(0).
+
+ * - qemu_ld_i32/i64/i128 *t0*, *t1*, *flags*, *memidx*
+
+ qemu_st_i32/i64/i128 *t0*, *t1*, *flags*, *memidx*
+
+ qemu_st8_i32 *t0*, *t1*, *flags*, *memidx*
+
+ - | Load data at the guest address *t1* into *t0*, or store data in *t0* at guest
+ address *t1*. The _i32/_i64/_i128 size applies to the size of the input/output
+ register *t0* only. The address *t1* is always sized according to the guest,
+ and the width of the memory operation is controlled by *flags*.
+ |
+ | Both *t0* and *t1* may be split into little-endian ordered pairs of registers
+ if dealing with 64-bit quantities on a 32-bit host, or 128-bit quantities on
+ a 64-bit host.
+ |
+ | The *memidx* selects the qemu tlb index to use (e.g. user or kernel access).
+ The flags are the MemOp bits, selecting the sign, width, and endianness
+ of the memory access.
+ |
+ | For a 32-bit host, qemu_ld/st_i64 is guaranteed to only be used with a
+ 64-bit memory access specified in *flags*.
+ |
+ | For qemu_ld/st_i128, these are only supported for a 64-bit host.
+ |
+ | For i386, qemu_st8_i32 is exactly like qemu_st_i32, except the size of
+ the memory operation is known to be 8-bit. This allows the backend to
+ provide a different set of register constraints.
+
+
+Host vector operations
+----------------------
+
+All of the vector ops have two parameters, ``TCGOP_VECL`` & ``TCGOP_VECE``.
+The former specifies the length of the vector in log2 64-bit units; the
+latter specifies the length of the element (if applicable) in log2 8-bit units.
+E.g. VECL = 1 -> 64 << 1 -> v128, and VECE = 2 -> 1 << 2 -> i32.
+
+.. list-table::
+
+ * - mov_vec *v0*, *v1*
+
+ ld_vec *v0*, *t1*
+
+ st_vec *v0*, *t1*
+
+ - | Move, load and store.
+
+ * - dup_vec *v0*, *r1*
+
+ - | Duplicate the low N bits of *r1* into VECL/VECE copies across *v0*.
+
+ * - dupi_vec *v0*, *c*
+
+ - | Similarly, for a constant.
+ | Smaller values will be replicated to host register size by the expanders.
+
+ * - dup2_vec *v0*, *r1*, *r2*
+
+ - | Duplicate *r2*:*r1* into VECL/64 copies across *v0*. This opcode is
+ only present for 32-bit hosts.
+
+ * - add_vec *v0*, *v1*, *v2*
+
+ - | *v0* = *v1* + *v2*, in elements across the vector.
+
+ * - sub_vec *v0*, *v1*, *v2*
+
+ - | Similarly, *v0* = *v1* - *v2*.
+
+ * - mul_vec *v0*, *v1*, *v2*
+
+ - | Similarly, *v0* = *v1* * *v2*.
+
+ * - neg_vec *v0*, *v1*
+
+ - | Similarly, *v0* = -*v1*.
+
+ * - abs_vec *v0*, *v1*
+
+ - | Similarly, *v0* = *v1* < 0 ? -*v1* : *v1*, in elements across the vector.
+
+ * - smin_vec *v0*, *v1*, *v2*
+
+ umin_vec *v0*, *v1*, *v2*
+
+ - | Similarly, *v0* = MIN(*v1*, *v2*), for signed and unsigned element types.
+
+ * - smax_vec *v0*, *v1*, *v2*
+
+ umax_vec *v0*, *v1*, *v2*
+
+ - | Similarly, *v0* = MAX(*v1*, *v2*), for signed and unsigned element types.
+
+ * - ssadd_vec *v0*, *v1*, *v2*
+
+ sssub_vec *v0*, *v1*, *v2*
+
+ usadd_vec *v0*, *v1*, *v2*
+
+ ussub_vec *v0*, *v1*, *v2*
+
+ - | Signed and unsigned saturating addition and subtraction.
+ |
+ | If the true result is not representable within the element type, the
+ element is set to the minimum or maximum value for the type.
+
+ * - and_vec *v0*, *v1*, *v2*
+
+ or_vec *v0*, *v1*, *v2*
+
+ xor_vec *v0*, *v1*, *v2*
+
+ andc_vec *v0*, *v1*, *v2*
+
+ orc_vec *v0*, *v1*, *v2*
+
+ not_vec *v0*, *v1*
+
+ - | Similarly, logical operations with and without complement.
+ |
+ | Note that VECE is unused.
+
+ * - shli_vec *v0*, *v1*, *i2*
+
+ shls_vec *v0*, *v1*, *s2*
+
+ - | Shift all elements from v1 by a scalar *i2*/*s2*. I.e.
+
+ .. code-block:: c
+
+ for (i = 0; i < VECL/VECE; ++i) {
+ v0[i] = v1[i] << s2;
+ }
+
+ * - shri_vec *v0*, *v1*, *i2*
+
+ sari_vec *v0*, *v1*, *i2*
+
+ rotli_vec *v0*, *v1*, *i2*
+
+ shrs_vec *v0*, *v1*, *s2*
+
+ sars_vec *v0*, *v1*, *s2*
+
+ - | Similarly for logical and arithmetic right shift, and left rotate.
+
+ * - shlv_vec *v0*, *v1*, *v2*
+
+ - | Shift elements from *v1* by elements from *v2*. I.e.
+
+ .. code-block:: c
+
+ for (i = 0; i < VECL/VECE; ++i) {
+ v0[i] = v1[i] << v2[i];
+ }
+
+ * - shrv_vec *v0*, *v1*, *v2*
+
+ sarv_vec *v0*, *v1*, *v2*
+
+ rotlv_vec *v0*, *v1*, *v2*
+
+ rotrv_vec *v0*, *v1*, *v2*
+
+ - | Similarly for logical and arithmetic right shift, and rotates.
+
+ * - cmp_vec *v0*, *v1*, *v2*, *cond*
+
+ - | Compare vectors by element, storing -1 for true and 0 for false.
+
+ * - bitsel_vec *v0*, *v1*, *v2*, *v3*
+
+ - | Bitwise select, *v0* = (*v2* & *v1*) | (*v3* & ~\ *v1*), across the entire vector.
+
+ * - cmpsel_vec *v0*, *c1*, *c2*, *v3*, *v4*, *cond*
+
+ - | Select elements based on comparison results:
+
+ .. code-block:: c
+
+ for (i = 0; i < n; ++i) {
+ v0[i] = (c1[i] cond c2[i]) ? v3[i] : v4[i].
+ }
+
+**Note 1**: Some shortcuts are defined when the last operand is known to be
+a constant (e.g. addi for add, movi for mov).
+
+**Note 2**: When using TCG, the opcodes must never be generated directly
+as some of them may not be available as "real" opcodes. Always use the
+function tcg_gen_xxx(args).
+
+
+Backend
+=======
+
+``tcg-target.h`` contains the target specific definitions. ``tcg-target.c.inc``
+contains the target specific code; it is #included by ``tcg/tcg.c``, rather
+than being a standalone C file.
+
+Assumptions
+-----------
+
+The target word size (``TCG_TARGET_REG_BITS``) is expected to be 32 bit or
+64 bit. It is expected that the pointer has the same size as the word.
+
+On a 32 bit target, all 64 bit operations are converted to 32 bits. A
+few specific operations must be implemented to allow it (see add2_i32,
+sub2_i32, brcond2_i32).
+
+On a 64 bit target, the values are transferred between 32 and 64-bit
+registers using the following ops:
+
+- extrl_i64_i32
+- extrh_i64_i32
+- ext_i32_i64
+- extu_i32_i64
+
+They ensure that the values are correctly truncated or extended when
+moved from a 32-bit to a 64-bit register or vice-versa. Note that the
+extrl_i64_i32 and extrh_i64_i32 are optional ops. It is not necessary
+to implement them if all the following conditions are met:
+
+- 64-bit registers can hold 32-bit values
+- 32-bit values in a 64-bit register do not need to stay zero or
+ sign extended
+- all 32-bit TCG ops ignore the high part of 64-bit registers
+
+Floating point operations are not supported in this version. A
+previous incarnation of the code generator had full support of them,
+but it is better to concentrate on integer operations first.
+
+Constraints
+----------------
+
+GCC like constraints are used to define the constraints of every
+instruction. Memory constraints are not supported in this
+version. Aliases are specified in the input operands as for GCC.
+
+The same register may be used for both an input and an output, even when
+they are not explicitly aliased. If an op expands to multiple target
+instructions then care must be taken to avoid clobbering input values.
+GCC style "early clobber" outputs are supported, with '``&``'.
+
+A target can define specific register or constant constraints. If an
+operation uses a constant input constraint which does not allow all
+constants, it must also accept registers in order to have a fallback.
+The constraint '``i``' is defined generically to accept any constant.
+The constraint '``r``' is not defined generically, but is consistently
+used by each backend to indicate all registers.
+
+The movi_i32 and movi_i64 operations must accept any constants.
+
+The mov_i32 and mov_i64 operations must accept any registers of the
+same type.
+
+The ld/st/sti instructions must accept signed 32 bit constant offsets.
+This can be implemented by reserving a specific register in which to
+compute the address if the offset is too big.
+
+The ld/st instructions must accept any destination (ld) or source (st)
+register.
+
+The sti instruction may fail if it cannot store the given constant.
+
+Function call assumptions
+-------------------------
+
+- The only supported types for parameters and return value are: 32 and
+ 64 bit integers and pointer.
+- The stack grows downwards.
+- The first N parameters are passed in registers.
+- The next parameters are passed on the stack by storing them as words.
+- Some registers are clobbered during the call.
+- The function can return 0 or 1 value in registers. On a 32 bit
+ target, functions must be able to return 2 values in registers for
+ 64 bit return type.
+
+
+Recommended coding rules for best performance
+=============================================
+
+- Use globals to represent the parts of the QEMU CPU state which are
+ often modified, e.g. the integer registers and the condition
+ codes. TCG will be able to use host registers to store them.
+
+- Don't hesitate to use helpers for complicated or seldom used guest
+ instructions. There is little performance advantage in using TCG to
+ implement guest instructions taking more than about twenty TCG
+ instructions. Note that this rule of thumb is more applicable to
+ helpers doing complex logic or arithmetic, where the C compiler has
+ scope to do a good job of optimisation; it is less relevant where
+ the instruction is mostly doing loads and stores, and in those cases
+ inline TCG may still be faster for longer sequences.
+
+- Use the 'discard' instruction if you know that TCG won't be able to
+ prove that a given global is "dead" at a given program point. The
+ x86 guest uses it to improve the condition codes optimisation.
diff --git a/docs/devel/tcg-plugins.rst b/docs/devel/tcg-plugins.rst
index dac5101a3c..9cc09d8c3d 100644
--- a/docs/devel/tcg-plugins.rst
+++ b/docs/devel/tcg-plugins.rst
@@ -3,7 +3,8 @@
Copyright (c) 2019, Linaro Limited
Written by Emilio Cota and Alex Bennée
-================
+.. _TCG Plugins:
+
QEMU TCG Plugins
================
@@ -16,8 +17,35 @@ only monitor it passively. However they can do this down to an
individual instruction granularity including potentially subscribing
to all load and store operations.
-API Stability
-=============
+Usage
+-----
+
+Any QEMU binary with TCG support has plugins enabled by default.
+Earlier releases needed to be explicitly enabled with::
+
+ configure --enable-plugins
+
+Once built a program can be run with multiple plugins loaded each with
+their own arguments::
+
+ $QEMU $OTHER_QEMU_ARGS \
+ -plugin contrib/plugin/libhowvec.so,inline=on,count=hint \
+ -plugin contrib/plugin/libhotblocks.so
+
+Arguments are plugin specific and can be used to modify their
+behaviour. In this case the howvec plugin is being asked to use inline
+ops to count and break down the hint instructions by type.
+
+Linux user-mode emulation also evaluates the environment variable
+``QEMU_PLUGIN``::
+
+ QEMU_PLUGIN="file=contrib/plugins/libhowvec.so,inline=on,count=hint" $QEMU
+
+Writing plugins
+---------------
+
+API versioning
+~~~~~~~~~~~~~~
This is a new feature for QEMU and it does allow people to develop
out-of-tree plugins that can be dynamically linked into a running QEMU
@@ -25,9 +53,6 @@ process. However the project reserves the right to change or break the
API should it need to do so. The best way to avoid this is to submit
your plugin upstream so they can be updated if/when the API changes.
-API versioning
---------------
-
All plugins need to declare a symbol which exports the plugin API
version they were built against. This can be done simply by::
@@ -43,18 +68,8 @@ current API versions supported by QEMU. The API version will be
incremented if new APIs are added. The minimum API version will be
incremented if existing APIs are changed or removed.
-Exposure of QEMU internals
---------------------------
-
-The plugin architecture actively avoids leaking implementation details
-about how QEMU's translation works to the plugins. While there are
-conceptions such as translation time and translation blocks the
-details are opaque to plugins. The plugin is able to query select
-details of instructions and system configuration only through the
-exported *qemu_plugin* functions.
-
-Query Handle Lifetime
----------------------
+Lifetime of the query handle
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each callback provides an opaque anonymous information handle which
can usually be further queried to find out information about a
@@ -63,32 +78,8 @@ valid during the lifetime of the callback so it is important that any
information that is needed is extracted during the callback and saved
by the plugin.
-API
-===
-
-.. kernel-doc:: include/qemu/qemu-plugin.h
-
-Usage
-=====
-
-Any QEMU binary with TCG support has plugins enabled by default.
-Earlier releases needed to be explicitly enabled with::
-
- configure --enable-plugins
-
-Once built a program can be run with multiple plugins loaded each with
-their own arguments::
-
- $QEMU $OTHER_QEMU_ARGS \
- -plugin tests/plugin/libhowvec.so,inline=on,count=hint \
- -plugin tests/plugin/libhotblocks.so
-
-Arguments are plugin specific and can be used to modify their
-behaviour. In this case the howvec plugin is being asked to use inline
-ops to count and break down the hint instructions by type.
-
-Plugin Life cycle
-=================
+Plugin life cycle
+~~~~~~~~~~~~~~~~~
First the plugin is loaded and the public qemu_plugin_install function
is called. The plugin will then register callbacks for various plugin
@@ -111,11 +102,70 @@ callback which can then ensure atomicity itself.
Finally when QEMU exits all the registered *atexit* callbacks are
invoked.
+Exposure of QEMU internals
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The plugin architecture actively avoids leaking implementation details
+about how QEMU's translation works to the plugins. While there are
+conceptions such as translation time and translation blocks the
+details are opaque to plugins. The plugin is able to query select
+details of instructions and system configuration only through the
+exported *qemu_plugin* functions.
+
+However the following assumptions can be made:
+
+Translation Blocks
+++++++++++++++++++
+
+All code will go through a translation phase although not all
+translations will be necessarily be executed. You need to instrument
+actual executions to track what is happening.
+
+It is quite normal to see the same address translated multiple times.
+If you want to track the code in system emulation you should examine
+the underlying physical address (``qemu_plugin_insn_haddr``) to take
+into account the effects of virtual memory although if the system does
+paging this will change too.
+
+Not all instructions in a block will always execute so if its
+important to track individual instruction execution you need to
+instrument them directly. However asynchronous interrupts will not
+change control flow mid-block.
+
+Instructions
+++++++++++++
+
+Instruction instrumentation runs before the instruction executes. You
+can be can be sure the instruction will be dispatched, but you can't
+be sure it will complete. Generally this will be because of a
+synchronous exception (e.g. SIGILL) triggered by the instruction
+attempting to execute. If you want to be sure you will need to
+instrument the next instruction as well. See the ``execlog.c`` plugin
+for examples of how to track this and finalise details after execution.
+
+Memory Accesses
++++++++++++++++
+
+Memory callbacks are called after a successful load or store.
+Unsuccessful operations (i.e. faults) will not be visible to memory
+instrumentation although the execution side effects can be observed
+(e.g. entering a exception handler).
+
+System Idle and Resume States
++++++++++++++++++++++++++++++
+
+The ``qemu_plugin_register_vcpu_idle_cb`` and
+``qemu_plugin_register_vcpu_resume_cb`` functions can be used to track
+when CPUs go into and return from sleep states when waiting for
+external I/O. Be aware though that these may occur less frequently
+than in real HW due to the inefficiencies of emulation giving less
+chance for the CPU to idle.
+
Internals
-=========
+---------
Locking
--------
+~~~~~~~
We have to ensure we cannot deadlock, particularly under MTTCG. For
this we acquire a lock when called from plugin code. We also keep the
@@ -146,12 +196,141 @@ Example Plugins
There are a number of plugins included with QEMU and you are
encouraged to contribute your own plugins plugins upstream. There is a
-``contrib/plugins`` directory where they can go.
+``contrib/plugins`` directory where they can go. There are also some
+basic plugins that are used to test and exercise the API during the
+``make check-tcg`` target in ``tests\plugins``.
+
+- tests/plugins/empty.c
+
+Purely a test plugin for measuring the overhead of the plugins system
+itself. Does no instrumentation.
+
+- tests/plugins/bb.c
+
+A very basic plugin which will measure execution in course terms as
+each basic block is executed. By default the results are shown once
+execution finishes::
+
+ $ qemu-aarch64 -plugin tests/plugin/libbb.so \
+ -d plugin ./tests/tcg/aarch64-linux-user/sha1
+ SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+ bb's: 2277338, insns: 158483046
+
+Behaviour can be tweaked with the following arguments:
+
+ * inline=true|false
+
+ Use faster inline addition of a single counter. Not per-cpu and not
+ thread safe.
+
+ * idle=true|false
+
+ Dump the current execution stats whenever the guest vCPU idles
+
+- tests/plugins/insn.c
+
+This is a basic instruction level instrumentation which can count the
+number of instructions executed on each core/thread::
+
+ $ qemu-aarch64 -plugin tests/plugin/libinsn.so \
+ -d plugin ./tests/tcg/aarch64-linux-user/threadcount
+ Created 10 threads
+ Done
+ cpu 0 insns: 46765
+ cpu 1 insns: 3694
+ cpu 2 insns: 3694
+ cpu 3 insns: 2994
+ cpu 4 insns: 1497
+ cpu 5 insns: 1497
+ cpu 6 insns: 1497
+ cpu 7 insns: 1497
+ total insns: 63135
+
+Behaviour can be tweaked with the following arguments:
+
+ * inline=true|false
+
+ Use faster inline addition of a single counter. Not per-cpu and not
+ thread safe.
+
+ * sizes=true|false
+
+ Give a summary of the instruction sizes for the execution
-- tests/plugins
+ * match=<string>
-These are some basic plugins that are used to test and exercise the
-API during the ``make check-tcg`` target.
+ Only instrument instructions matching the string prefix. Will show
+ some basic stats including how many instructions have executed since
+ the last execution. For example::
+
+ $ qemu-aarch64 -plugin tests/plugin/libinsn.so,match=bl \
+ -d plugin ./tests/tcg/aarch64-linux-user/sha512-vector
+ ...
+ 0x40069c, 'bl #0x4002b0', 10 hits, 1093 match hits, Δ+1257 since last match, 98 avg insns/match
+ 0x4006ac, 'bl #0x403690', 10 hits, 1094 match hits, Δ+47 since last match, 98 avg insns/match
+ 0x4037fc, 'bl #0x4002b0', 18 hits, 1095 match hits, Δ+22 since last match, 98 avg insns/match
+ 0x400720, 'bl #0x403690', 10 hits, 1096 match hits, Δ+58 since last match, 98 avg insns/match
+ 0x4037fc, 'bl #0x4002b0', 19 hits, 1097 match hits, Δ+22 since last match, 98 avg insns/match
+ 0x400730, 'bl #0x403690', 10 hits, 1098 match hits, Δ+33 since last match, 98 avg insns/match
+ 0x4037ac, 'bl #0x4002b0', 12 hits, 1099 match hits, Δ+20 since last match, 98 avg insns/match
+ ...
+
+For more detailed execution tracing see the ``execlog`` plugin for
+other options.
+
+- tests/plugins/mem.c
+
+Basic instruction level memory instrumentation::
+
+ $ qemu-aarch64 -plugin tests/plugin/libmem.so,inline=true \
+ -d plugin ./tests/tcg/aarch64-linux-user/sha1
+ SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
+ inline mem accesses: 79525013
+
+Behaviour can be tweaked with the following arguments:
+
+ * inline=true|false
+
+ Use faster inline addition of a single counter. Not per-cpu and not
+ thread safe.
+
+ * callback=true|false
+
+ Use callbacks on each memory instrumentation.
+
+ * hwaddr=true|false
+
+ Count IO accesses (only for system emulation)
+
+- tests/plugins/syscall.c
+
+A basic syscall tracing plugin. This only works for user-mode. By
+default it will give a summary of syscall stats at the end of the
+run::
+
+ $ qemu-aarch64 -plugin tests/plugin/libsyscall \
+ -d plugin ./tests/tcg/aarch64-linux-user/threadcount
+ Created 10 threads
+ Done
+ syscall no. calls errors
+ 226 12 0
+ 99 11 11
+ 115 11 0
+ 222 11 0
+ 93 10 0
+ 220 10 0
+ 233 10 0
+ 215 8 0
+ 214 4 0
+ 134 2 0
+ 64 2 0
+ 96 1 0
+ 94 1 0
+ 80 1 0
+ 261 1 0
+ 78 1 0
+ 160 1 0
+ 135 1 0
- contrib/plugins/hotblocks.c
@@ -168,7 +347,7 @@ slightly faster (but not thread safe) counters.
Example::
- ./aarch64-linux-user/qemu-aarch64 \
+ $ qemu-aarch64 \
-plugin contrib/plugins/libhotblocks.so -d plugin \
./tests/tcg/aarch64-linux-user/sha1
SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
@@ -182,7 +361,7 @@ Example::
Similar to hotblocks but this time tracks memory accesses::
- ./aarch64-linux-user/qemu-aarch64 \
+ $ qemu-aarch64 \
-plugin contrib/plugins/libhotpages.so -d plugin \
./tests/tcg/aarch64-linux-user/sha1
SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6
@@ -212,11 +391,11 @@ The hotpages plugin can be configured using the following arguments:
This is an instruction classifier so can be used to count different
types of instructions. It has a number of options to refine which get
-counted. You can give a value to the `count` argument for a class of
+counted. You can give a value to the ``count`` argument for a class of
instructions to break it down fully, so for example to see all the system
registers accesses::
- ./aarch64-softmmu/qemu-system-aarch64 $(QEMU_ARGS) \
+ $ qemu-system-aarch64 $(QEMU_ARGS) \
-append "root=/dev/sda2 systemd.unit=benchmark.service" \
-smp 4 -plugin ./contrib/plugins/libhowvec.so,count=sreg -d plugin
@@ -284,10 +463,10 @@ for the plugin is a path for the socket the two instances will
communicate over::
- ./sparc-softmmu/qemu-system-sparc -monitor none -parallel none \
+ $ qemu-system-sparc -monitor none -parallel none \
-net none -M SS-20 -m 256 -kernel day11/zImage.elf \
-plugin ./contrib/plugins/liblockstep.so,sockpath=lockstep-sparc.sock \
- -d plugin,nochain
+ -d plugin,nochain
which will eventually report::
@@ -342,9 +521,9 @@ The execlog tool traces executed instructions with memory access. It can be used
for debugging and security analysis purposes.
Please be aware that this will generate a lot of output.
-The plugin takes no argument::
+The plugin needs default argument::
- qemu-system-arm $(QEMU_ARGS) \
+ $ qemu-system-arm $(QEMU_ARGS) \
-plugin ./contrib/plugins/libexeclog.so -d plugin
which will output an execution trace following this structure::
@@ -360,12 +539,36 @@ which will output an execution trace following this structure::
0, 0xd34, 0xf9c8f000, "bl #0x10c8"
0, 0x10c8, 0xfff96c43, "ldr r3, [r0, #0x44]", load, 0x200000e4, RAM
+the output can be filtered to only track certain instructions or
+addresses using the ``ifilter`` or ``afilter`` options. You can stack the
+arguments if required::
+
+ $ qemu-system-arm $(QEMU_ARGS) \
+ -plugin ./contrib/plugins/libexeclog.so,ifilter=st1w,afilter=0x40001808 -d plugin
+
+This plugin can also dump registers when they change value. Specify the name of the
+registers with multiple ``reg`` options. You can also use glob style matching if you wish::
+
+ $ qemu-system-arm $(QEMU_ARGS) \
+ -plugin ./contrib/plugins/libexeclog.so,reg=\*_el2,reg=sp -d plugin
+
+Be aware that each additional register to check will slow down
+execution quite considerably. You can optimise the number of register
+checks done by using the rdisas option. This will only instrument
+instructions that mention the registers in question in disassembly.
+This is not foolproof as some instructions implicitly change
+instructions. You can use the ifilter to catch these cases:
+
+ $ qemu-system-arm $(QEMU_ARGS) \
+ -plugin ./contrib/plugins/libexeclog.so,ifilter=msr,ifilter=blr,reg=x30,reg=\*_el1,rdisas=on
+
- contrib/plugins/cache.c
-Cache modelling plugin that measures the performance of a given cache
-configuration when a given working set is run::
+Cache modelling plugin that measures the performance of a given L1 cache
+configuration, and optionally a unified L2 per-core cache when a given working
+set is run::
- qemu-x86_64 -plugin ./contrib/plugins/libcache.so \
+ $ qemu-x86_64 -plugin ./contrib/plugins/libcache.so \
-d plugin -D cache.log ./tests/tcg/x86_64-linux-user/float_convs
will report the following::
@@ -421,3 +624,27 @@ The plugin has a number of arguments, all of them are optional:
Sets the number of cores for which we maintain separate icache and dcache.
(default: for linux-user, N = 1, for full system emulation: N = cores
available to guest)
+
+ * l2=on
+
+ Simulates a unified L2 cache (stores blocks for both instructions and data)
+ using the default L2 configuration (cache size = 2MB, associativity = 16-way,
+ block size = 64B).
+
+ * l2cachesize=N
+ * l2blksize=B
+ * l2assoc=A
+
+ L2 cache configuration arguments. They specify the cache size, block size, and
+ associativity of the L2 cache, respectively. Setting any of the L2
+ configuration arguments implies ``l2=on``.
+ (default: N = 2097152 (2MB), B = 64, A = 16)
+
+Plugin API
+==========
+
+The following API is generated from the inline documentation in
+``include/qemu/qemu-plugin.h``. Please ensure any updates to the API
+include the full kernel-doc annotations.
+
+.. kernel-doc:: include/qemu/qemu-plugin.h
diff --git a/docs/devel/tcg.rst b/docs/devel/tcg.rst
index a65fb7b1c4..2786f2f679 100644
--- a/docs/devel/tcg.rst
+++ b/docs/devel/tcg.rst
@@ -1,3 +1,5 @@
+.. _tcg_internals:
+
====================
Translator Internals
====================
@@ -9,7 +11,7 @@ which make it relatively easily portable and simple while achieving good
performances.
QEMU's dynamic translation backend is called TCG, for "Tiny Code
-Generator". For more information, please take a look at ``tcg/README``.
+Generator". For more information, please take a look at :ref:`tcg-ops-ref`.
The following sections outline some notable features and implementation
details of QEMU's dynamic translator.
@@ -188,3 +190,26 @@ memory areas instead calls out to C code for device emulation.
Finally, the MMU helps tracking dirty pages and pages pointed to by
translation blocks.
+Profiling JITted code
+---------------------
+
+The Linux ``perf`` tool will treat all JITted code as a single block as
+unlike the main code it can't use debug information to link individual
+program counter samples with larger functions. To overcome this
+limitation you can use the ``-perfmap`` or the ``-jitdump`` option to generate
+map files. ``-perfmap`` is lightweight and produces only guest-host mappings.
+``-jitdump`` additionally saves JITed code and guest debug information (if
+available); its output needs to be integrated with the ``perf.data`` file
+before the final report can be viewed.
+
+.. code::
+
+ perf record $QEMU -perfmap $REMAINING_ARGS
+ perf report
+
+ perf record -k 1 $QEMU -jitdump $REMAINING_ARGS
+ DEBUGINFOD_URLS= perf inject -j -i perf.data -o perf.data.jitted
+ perf report -i perf.data.jitted
+
+Note that qemu-system generates mappings only for ``-kernel`` files in ELF
+format.
diff --git a/docs/devel/testing.rst b/docs/devel/testing.rst
index 64c9744795..fa28e3ecb2 100644
--- a/docs/devel/testing.rst
+++ b/docs/devel/testing.rst
@@ -1,11 +1,12 @@
-===============
+.. _testing:
+
Testing in QEMU
===============
This document describes the testing infrastructure in QEMU.
Testing with "make check"
-=========================
+-------------------------
The "make check" testing family includes most of the C based tests in QEMU. For
a quick help, run ``make check-help`` from the source tree.
@@ -24,7 +25,7 @@ expect the executables to exist and will fail with obscure messages if they
cannot find them.
Unit tests
-----------
+~~~~~~~~~~
Unit tests, which can be invoked with ``make check-unit``, are simple C tests
that typically link to individual QEMU object files and exercise them by
@@ -67,7 +68,7 @@ and copy the actual command line which executes the unit test, then run
it from the command line.
QTest
------
+~~~~~
QTest is a device emulation testing framework. It can be very useful to test
device models; it could also control certain aspects of QEMU (such as virtual
@@ -80,8 +81,38 @@ QTest cases can be executed with
make check-qtest
+Writing portable test cases
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Both unit tests and qtests can run on POSIX hosts as well as Windows hosts.
+Care must be taken when writing portable test cases that can be built and run
+successfully on various hosts. The following list shows some best practices:
+
+* Use portable APIs from glib whenever necessary, e.g.: g_setenv(),
+ g_mkdtemp(), g_mkdir().
+* Avoid using hardcoded /tmp for temporary file directory.
+ Use g_get_tmp_dir() instead.
+* Bear in mind that Windows has different special string representation for
+ stdin/stdout/stderr and null devices. For example if your test case uses
+ "/dev/fd/2" and "/dev/null" on Linux, remember to use "2" and "nul" on
+ Windows instead. Also IO redirection does not work on Windows, so avoid
+ using "2>nul" whenever necessary.
+* If your test cases uses the blkdebug feature, use relative path to pass
+ the config and image file paths in the command line as Windows absolute
+ path contains the delimiter ":" which will confuse the blkdebug parser.
+* Use double quotes in your extra QEMU command line in your test cases
+ instead of single quotes, as Windows does not drop single quotes when
+ passing the command line to QEMU.
+* Windows opens a file in text mode by default, while a POSIX compliant
+ implementation treats text files and binary files the same. So if your
+ test cases opens a file to write some data and later wants to compare the
+ written data with the original one, be sure to pass the letter 'b' as
+ part of the mode string to fopen(), or O_BINARY flag for the open() call.
+* If a certain test case can only run on POSIX or Linux hosts, use a proper
+ #ifdef in the codes. If the whole test suite cannot run on Windows, disable
+ the build in the meson.build file.
+
QAPI schema tests
------------------
+~~~~~~~~~~~~~~~~~
The QAPI schema tests validate the QAPI parser used by QMP, by feeding
predefined input to the parser and comparing the result with the reference
@@ -108,33 +139,14 @@ parser (either fixing a bug or extending/modifying the syntax). To do this:
``qapi-schema += foo.json``
check-block
------------
+~~~~~~~~~~~
``make check-block`` runs a subset of the block layer iotests (the tests that
are in the "auto" group).
See the "QEMU iotests" section below for more information.
-GCC gcov support
-----------------
-
-``gcov`` is a GCC tool to analyze the testing coverage by
-instrumenting the tested code. To use it, configure QEMU with
-``--enable-gcov`` option and build. Then run ``make check`` as usual.
-
-If you want to gather coverage information on a single test the ``make
-clean-gcda`` target can be used to delete any existing coverage
-information before running a single test.
-
-You can generate a HTML coverage report by executing ``make
-coverage-html`` which will create
-``meson-logs/coveragereport/index.html``.
-
-Further analysis can be conducted by running the ``gcov`` command
-directly on the various .gcda output files. Please read the ``gcov``
-documentation for more information.
-
QEMU iotests
-============
+------------
QEMU iotests, under the directory ``tests/qemu-iotests``, is the testing
framework widely used to test block layer related features. It is higher level
@@ -171,7 +183,7 @@ More options are supported by the ``./check`` script, run ``./check -h`` for
help.
Writing a new test case
------------------------
+~~~~~~~~~~~~~~~~~~~~~~~
Consider writing a tests case when you are making any changes to the block
layer. An iotest case is usually the choice for that. There are already many
@@ -225,7 +237,8 @@ test failure. If using such devices are explicitly desired, consider adding
``locking=off`` option to disable image locking.
Debugging a test case
------------------------
+~~~~~~~~~~~~~~~~~~~~~
+
The following options to the ``check`` script can be useful when debugging
a failing test:
@@ -254,7 +267,7 @@ a failing test:
``$TEST_DIR/qemu-machine-<random_string>``.
Test case groups
-----------------
+~~~~~~~~~~~~~~~~
"Tests may belong to one or more test groups, which are defined in the form
of a comment in the test source file. By convention, test groups are listed
@@ -304,17 +317,17 @@ Note that the following group names have a special meaning:
.. _container-ref:
Container based tests
-=====================
+---------------------
Introduction
-------------
+~~~~~~~~~~~~
The container testing framework in QEMU utilizes public images to
build and test QEMU in predefined and widely accessible Linux
environments. This makes it possible to expand the test coverage
across distros, toolchain flavors and library versions. The support
was originally written for Docker although we also support Podman as
-an alternative container runtime. Although the many of the target
+an alternative container runtime. Although many of the target
names and scripts are prefixed with "docker" the system will
automatically run on whichever is configured.
@@ -322,7 +335,7 @@ The container images are also used to augment the generation of tests
for testing TCG. See :ref:`checktcg-ref` for more details.
Docker Prerequisites
---------------------
+~~~~~~~~~~~~~~~~~~~~
Install "docker" with the system package manager and start the Docker service
on your development machine, then make sure you have the privilege to run
@@ -353,7 +366,7 @@ exploit the whole host with Docker bind mounting or other privileged
operations. So only do it on development machines.
Podman Prerequisites
---------------------
+~~~~~~~~~~~~~~~~~~~~
Install "podman" with the system package manager.
@@ -365,7 +378,7 @@ Install "podman" with the system package manager.
The last command should print an empty table, to verify the system is ready.
Quickstart
-----------
+~~~~~~~~~~
From source tree, type ``make docker-help`` to see the help. Testing
can be started without configuring or building QEMU (``configure`` and
@@ -381,7 +394,7 @@ is downloaded and initialized automatically), in which the ``test-build`` job
is executed.
Registry
---------
+~~~~~~~~
The QEMU project has a container registry hosted by GitLab at
``registry.gitlab.com/qemu-project/qemu`` which will automatically be
@@ -392,25 +405,149 @@ locally by using the ``NOCACHE`` build option:
.. code::
- make docker-image-debian10 NOCACHE=1
+ make docker-image-debian-arm64-cross NOCACHE=1
Images
-------
+~~~~~~
Along with many other images, the ``centos8`` image is defined in a Dockerfile
in ``tests/docker/dockerfiles/``, called ``centos8.docker``. ``make docker-help``
command will list all the available images.
-To add a new image, simply create a new ``.docker`` file under the
-``tests/docker/dockerfiles/`` directory.
-
A ``.pre`` script can be added beside the ``.docker`` file, which will be
executed before building the image under the build context directory. This is
mainly used to do necessary host side setup. One such setup is ``binfmt_misc``,
for example, to make qemu-user powered cross build containers work.
+Most of the existing Dockerfiles were written by hand, simply by creating a
+a new ``.docker`` file under the ``tests/docker/dockerfiles/`` directory.
+This has led to an inconsistent set of packages being present across the
+different containers.
+
+Thus going forward, QEMU is aiming to automatically generate the Dockerfiles
+using the ``lcitool`` program provided by the ``libvirt-ci`` project:
+
+ https://gitlab.com/libvirt/libvirt-ci
+
+``libvirt-ci`` contains an ``lcitool`` program as well as a list of
+mappings to distribution package names for a wide variety of third
+party projects. ``lcitool`` applies the mappings to a list of build
+pre-requisites in ``tests/lcitool/projects/qemu.yml``, determines the
+list of native packages to install on each distribution, and uses them
+to generate build environments (dockerfiles and Cirrus CI variable files)
+that are consistent across OS distribution.
+
+
+Adding new build pre-requisites
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When preparing a patch series that adds a new build
+pre-requisite to QEMU, the prerequisites should to be added to
+``tests/lcitool/projects/qemu.yml`` in order to make the dependency
+available in the CI build environments.
+
+In the simple case where the pre-requisite is already known to ``libvirt-ci``
+the following steps are needed:
+
+ * Edit ``tests/lcitool/projects/qemu.yml`` and add the pre-requisite
+
+ * Run ``make lcitool-refresh`` to re-generate all relevant build environment
+ manifests
+
+It may be that ``libvirt-ci`` does not know about the new pre-requisite.
+If that is the case, some extra preparation steps will be required
+first to contribute the mapping to the ``libvirt-ci`` project:
+
+ * Fork the ``libvirt-ci`` project on gitlab
+
+ * Add an entry for the new build prerequisite to
+ ``lcitool/facts/mappings.yml``, listing its native package name on as
+ many OS distros as practical. Run ``python -m pytest --regenerate-output``
+ and check that the changes are correct.
+
+ * Commit the ``mappings.yml`` change together with the regenerated test
+ files, and submit a merge request to the ``libvirt-ci`` project.
+ Please note in the description that this is a new build pre-requisite
+ desired for use with QEMU.
+
+ * CI pipeline will run to validate that the changes to ``mappings.yml``
+ are correct, by attempting to install the newly listed package on
+ all OS distributions supported by ``libvirt-ci``.
+
+ * Once the merge request is accepted, go back to QEMU and update
+ the ``tests/lcitool/libvirt-ci`` submodule to point to a commit that
+ contains the ``mappings.yml`` update. Then add the prerequisite and
+ run ``make lcitool-refresh``.
+
+ * Please also trigger gitlab container generation pipelines on your change
+ for as many OS distros as practical to make sure that there are no
+ obvious breakages when adding the new pre-requisite. Please see
+ `CI <https://www.qemu.org/docs/master/devel/ci.html>`__ documentation
+ page on how to trigger gitlab CI pipelines on your change.
+
+ * Please also trigger gitlab container generation pipelines on your change
+ for as many OS distros as practical to make sure that there are no
+ obvious breakages when adding the new pre-requisite. Please see
+ `CI <https://www.qemu.org/docs/master/devel/ci.html>`__ documentation
+ page on how to trigger gitlab CI pipelines on your change.
+
+For enterprise distros that default to old, end-of-life versions of the
+Python runtime, QEMU uses a separate set of mappings that work with more
+recent versions. These can be found in ``tests/lcitool/mappings.yml``.
+Modifying this file should not be necessary unless the new pre-requisite
+is a Python library or tool.
+
+
+Adding new OS distros
+^^^^^^^^^^^^^^^^^^^^^
+
+In some cases ``libvirt-ci`` will not know about the OS distro that is
+desired to be tested. Before adding a new OS distro, discuss the proposed
+addition:
+
+ * Send a mail to qemu-devel, copying people listed in the
+ MAINTAINERS file for ``Build and test automation``.
+
+ There are limited CI compute resources available to QEMU, so the
+ cost/benefit tradeoff of adding new OS distros needs to be considered.
+
+ * File an issue at https://gitlab.com/libvirt/libvirt-ci/-/issues
+ pointing to the qemu-devel mail thread in the archives.
+
+ This alerts other people who might be interested in the work
+ to avoid duplication, as well as to get feedback from libvirt-ci
+ maintainers on any tips to ease the addition
+
+Assuming there is agreement to add a new OS distro then
+
+ * Fork the ``libvirt-ci`` project on gitlab
+
+ * Add metadata under ``lcitool/facts/targets/`` for the new OS
+ distro. There might be code changes required if the OS distro
+ uses a package format not currently known. The ``libvirt-ci``
+ maintainers can advise on this when the issue is filed.
+
+ * Edit the ``lcitool/facts/mappings.yml`` change to add entries for
+ the new OS, listing the native package names for as many packages
+ as practical. Run ``python -m pytest --regenerate-output`` and
+ check that the changes are correct.
+
+ * Commit the changes to ``lcitool/facts`` and the regenerated test
+ files, and submit a merge request to the ``libvirt-ci`` project.
+ Please note in the description that this is a new build pre-requisite
+ desired for use with QEMU
+
+ * CI pipeline will run to validate that the changes to ``mappings.yml``
+ are correct, by attempting to install the newly listed package on
+ all OS distributions supported by ``libvirt-ci``.
+
+ * Once the merge request is accepted, go back to QEMU and update
+ the ``libvirt-ci`` submodule to point to a commit that contains
+ the ``mappings.yml`` update.
+
+
Tests
------
+~~~~~
Different tests are added to cover various configurations to build and test
QEMU. Docker tests are the executables under ``tests/docker`` named
@@ -421,13 +558,13 @@ source and build it.
The full list of tests is printed in the ``make docker-help`` help.
Debugging a Docker test failure
--------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
When CI tasks, maintainers or yourself report a Docker test failure, follow the
below steps to debug it:
1. Locally reproduce the failure with the reported command line. E.g. run
- ``make docker-test-mingw@fedora J=8``.
+ ``make docker-test-mingw@fedora-win64-cross J=8``.
2. Add "V=1" to the command line, try again, to see the verbose output.
3. Further add "DEBUG=1" to the command line. This will pause in a shell prompt
in the container right before testing starts. You could either manually
@@ -438,7 +575,7 @@ below steps to debug it:
the prompt for debug.
Options
--------
+~~~~~~~
Various options can be used to affect how Docker tests are done. The full
list is in the ``make docker`` help text. The frequently used ones are:
@@ -452,7 +589,7 @@ list is in the ``make docker`` help text. The frequently used ones are:
failure" section.
Thread Sanitizer
-================
+----------------
Thread Sanitizer (TSan) is a tool which can detect data races. QEMU supports
building and testing with this tool.
@@ -462,14 +599,14 @@ For more information on TSan:
https://github.com/google/sanitizers/wiki/ThreadSanitizerCppManual
Thread Sanitizer in Docker
----------------------------
-TSan is currently supported in the ubuntu2004 docker.
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+TSan is currently supported in the ubuntu2204 docker.
The test-tsan test will build using TSan and then run make check.
.. code::
- make docker-test-tsan@ubuntu2004
+ make docker-test-tsan@ubuntu2204
TSan warnings under docker are placed in files located at build/tsan/.
@@ -477,7 +614,7 @@ We recommend using DEBUG=1 to allow launching the test from inside the docker,
and to allow review of the warnings generated by TSan.
Building and Testing with TSan
-------------------------------
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
It is possible to build and test with TSan, with a few additional steps.
These steps are normally done automatically in the docker.
@@ -516,7 +653,7 @@ This allows for running the test and then checking the warnings afterwards.
If you want TSan to stop and exit with error on warnings, use exitcode=66.
TSan Suppressions
------------------
+~~~~~~~~~~~~~~~~~
Keep in mind that for any data race warning, although there might be a data race
detected by TSan, there might be no actual bug here. TSan provides several
different mechanisms for suppressing warnings. In general it is recommended
@@ -531,18 +668,18 @@ suppressing it. More information on the file format can be found here:
https://github.com/google/sanitizers/wiki/ThreadSanitizerSuppressions
-tests/tsan/blacklist.tsan - Has TSan warnings we wish to disable
+tests/tsan/ignore.tsan - Has TSan warnings we wish to disable
at compile time for test or debug.
Add flags to configure to enable:
-"--extra-cflags=-fsanitize-blacklist=<src path>/tests/tsan/blacklist.tsan"
+"--extra-cflags=-fsanitize-blacklist=<src path>/tests/tsan/ignore.tsan"
More information on the file format can be found here under "Blacklist Format":
https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags
TSan Annotations
-----------------
+~~~~~~~~~~~~~~~~
include/qemu/tsan.h defines annotations. See this file for more descriptions
of the annotations themselves. Annotations can be used to suppress
TSan warnings or give TSan more information so that it can detect proper
@@ -558,15 +695,53 @@ The full set of annotations can be found here:
https://github.com/llvm/llvm-project/blob/master/compiler-rt/lib/tsan/rtl/tsan_interface_ann.cpp
+docker-binfmt-image-debian-% targets
+------------------------------------
+
+It is possible to combine Debian's bootstrap scripts with a configured
+``binfmt_misc`` to bootstrap a number of Debian's distros including
+experimental ports not yet supported by a released OS. This can
+simplify setting up a rootfs by using docker to contain the foreign
+rootfs rather than manually invoking chroot.
+
+Setting up ``binfmt_misc``
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+You can use the script ``qemu-binfmt-conf.sh`` to configure a QEMU
+user binary to automatically run binaries for the foreign
+architecture. While the scripts will try their best to work with
+dynamically linked QEMU's a statically linked one will present less
+potential complications when copying into the docker image. Modern
+kernels support the ``F`` (fix binary) flag which will open the QEMU
+executable on setup and avoids the need to find and re-open in the
+chroot environment. This is triggered with the ``--persistent`` flag.
+
+Example invocation
+~~~~~~~~~~~~~~~~~~
+
+For example to setup the HPPA ports builds of Debian::
+
+ make docker-binfmt-image-debian-sid-hppa \
+ DEB_TYPE=sid DEB_ARCH=hppa \
+ DEB_URL=http://ftp.ports.debian.org/debian-ports/ \
+ DEB_KEYRING=/usr/share/keyrings/debian-ports-archive-keyring.gpg \
+ EXECUTABLE=(pwd)/qemu-hppa V=1
+
+The ``DEB_`` variables are substitutions used by
+``debian-bootstrap.pre`` which is called to do the initial debootstrap
+of the rootfs before it is copied into the container. The second stage
+is run as part of the build. The final image will be tagged as
+``qemu/debian-sid-hppa``.
+
VM testing
-==========
+----------
This test suite contains scripts that bootstrap various guest images that have
necessary packages to build QEMU. The basic usage is documented in ``Makefile``
help which is displayed with ``make vm-help``.
Quickstart
-----------
+~~~~~~~~~~
Run ``make vm-help`` to list available make targets. Invoke a specific make
command to run build test in an image. For example, ``make vm-build-freebsd``
@@ -581,29 +756,29 @@ concerned about attackers taking control of the guest and potentially
exploiting a QEMU security bug to compromise the host.
QEMU binaries
--------------
+~~~~~~~~~~~~~
-By default, qemu-system-x86_64 is searched in $PATH to run the guest. If there
-isn't one, or if it is older than 2.10, the test won't work. In this case,
+By default, ``qemu-system-x86_64`` is searched in $PATH to run the guest. If
+there isn't one, or if it is older than 2.10, the test won't work. In this case,
provide the QEMU binary in env var: ``QEMU=/path/to/qemu-2.10+``.
-Likewise the path to qemu-img can be set in QEMU_IMG environment variable.
+Likewise the path to ``qemu-img`` can be set in QEMU_IMG environment variable.
Make jobs
----------
+~~~~~~~~~
The ``-j$X`` option in the make command line is not propagated into the VM,
specify ``J=$X`` to control the make jobs in the guest.
Debugging
----------
+~~~~~~~~~
Add ``DEBUG=1`` and/or ``V=1`` to the make command to allow interactive
debugging and verbose output. If this is not enough, see the next section.
``V=1`` will be propagated down into the make jobs in the guest.
Manual invocation
------------------
+~~~~~~~~~~~~~~~~~
Each guest script is an executable script with the same command line options.
For example to work with the netbsd guest, use ``$QEMU_SRC/tests/vm/netbsd``:
@@ -627,7 +802,7 @@ For example to work with the netbsd guest, use ``$QEMU_SRC/tests/vm/netbsd``:
$ ./netbsd --interactive --image /var/tmp/netbsd.img sh
Adding new guests
------------------
+~~~~~~~~~~~~~~~~~
Please look at existing guest scripts for how to add new guests.
@@ -660,7 +835,7 @@ the script's ``main()``.
recommended.
Image fuzzer testing
-====================
+--------------------
An image fuzzer was added to exercise format drivers. Currently only qcow2 is
supported. To start the fuzzer, run
@@ -669,20 +844,19 @@ supported. To start the fuzzer, run
tests/image-fuzzer/runner.py -c '[["qemu-img", "info", "$test_img"]]' /tmp/test qcow2
-Alternatively, some command different from "qemu-img info" can be tested, by
+Alternatively, some command different from ``qemu-img info`` can be tested, by
changing the ``-c`` option.
-Acceptance tests using the Avocado Framework
-============================================
+Integration tests using the Avocado Framework
+---------------------------------------------
-The ``tests/acceptance`` directory hosts functional tests, also known
-as acceptance level tests. They're usually higher level tests, and
-may interact with external resources and with various guest operating
-systems.
+The ``tests/avocado`` directory hosts integration tests. They're usually
+higher level tests, and may interact with external resources and with
+various guest operating systems.
These tests are written using the Avocado Testing Framework (which must
be installed separately) in conjunction with a the ``avocado_qemu.Test``
-class, implemented at ``tests/acceptance/avocado_qemu``.
+class, implemented at ``tests/avocado/avocado_qemu``.
Tests based on ``avocado_qemu.Test`` can easily:
@@ -712,17 +886,17 @@ Tests based on ``avocado_qemu.Test`` can easily:
- http://avocado-framework.readthedocs.io/en/latest/api/utils/avocado.utils.html
Running tests
--------------
+~~~~~~~~~~~~~
-You can run the acceptance tests simply by executing:
+You can run the avocado tests simply by executing:
.. code::
- make check-acceptance
+ make check-avocado
-This involves the automatic creation of Python virtual environment
-within the build tree (at ``tests/venv``) which will have all the
-right dependencies, and will save tests results also within the
+This involves the automatic installation, from PyPI, of all the
+necessary avocado-framework dependencies into the QEMU venv within the
+build tree (at ``./pyvenv``). Test results are also saved within the
build tree (at ``tests/results``).
Note: the build environment must be using a Python 3 stack, and have
@@ -733,12 +907,12 @@ specific version, they may be on packages named ``python3-venv`` and
``python3-pip``.
It is also possible to run tests based on tags using the
-``make check-acceptance`` command and the ``AVOCADO_TAGS`` environment
+``make check-avocado`` command and the ``AVOCADO_TAGS`` environment
variable:
.. code::
- make check-acceptance AVOCADO_TAGS=quick
+ make check-avocado AVOCADO_TAGS=quick
Note that tags separated with commas have an AND behavior, while tags
separated by spaces have an OR behavior. For more information on Avocado
@@ -747,31 +921,31 @@ tags, see:
https://avocado-framework.readthedocs.io/en/latest/guides/user/chapters/tags.html
To run a single test file, a couple of them, or a test within a file
-using the ``make check-acceptance`` command, set the ``AVOCADO_TESTS``
+using the ``make check-avocado`` command, set the ``AVOCADO_TESTS``
environment variable with the test files or test names. To run all
tests from a single file, use:
.. code::
- make check-acceptance AVOCADO_TESTS=$FILEPATH
+ make check-avocado AVOCADO_TESTS=$FILEPATH
The same is valid to run tests from multiple test files:
.. code::
- make check-acceptance AVOCADO_TESTS='$FILEPATH1 $FILEPATH2'
+ make check-avocado AVOCADO_TESTS='$FILEPATH1 $FILEPATH2'
To run a single test within a file, use:
.. code::
- make check-acceptance AVOCADO_TESTS=$FILEPATH:$TESTCLASS.$TESTNAME
+ make check-avocado AVOCADO_TESTS=$FILEPATH:$TESTCLASS.$TESTNAME
The same is valid to run single tests from multiple test files:
.. code::
- make check-acceptance AVOCADO_TESTS='$FILEPATH1:$TESTCLASS1.$TESTNAME1 $FILEPATH2:$TESTCLASS2.$TESTNAME2'
+ make check-avocado AVOCADO_TESTS='$FILEPATH1:$TESTCLASS1.$TESTNAME1 $FILEPATH2:$TESTCLASS2.$TESTNAME2'
The scripts installed inside the virtual environment may be used
without an "activation". For instance, the Avocado test runner
@@ -779,9 +953,9 @@ may be invoked by running:
.. code::
- tests/venv/bin/avocado run $OPTION1 $OPTION2 tests/acceptance/
+ pyvenv/bin/avocado run $OPTION1 $OPTION2 tests/avocado/
-Note that if ``make check-acceptance`` was not executed before, it is
+Note that if ``make check-avocado`` was not executed before, it is
possible to create the Python virtual environment with the dependencies
needed running:
@@ -794,23 +968,23 @@ a test file. To run tests from a single file within the build tree, use:
.. code::
- tests/venv/bin/avocado run tests/acceptance/$TESTFILE
+ pyvenv/bin/avocado run tests/avocado/$TESTFILE
To run a single test within a test file, use:
.. code::
- tests/venv/bin/avocado run tests/acceptance/$TESTFILE:$TESTCLASS.$TESTNAME
+ pyvenv/bin/avocado run tests/avocado/$TESTFILE:$TESTCLASS.$TESTNAME
Valid test names are visible in the output from any previous execution
-of Avocado or ``make check-acceptance``, and can also be queried using:
+of Avocado or ``make check-avocado``, and can also be queried using:
.. code::
- tests/venv/bin/avocado list tests/acceptance
+ pyvenv/bin/avocado list tests/avocado
Manual Installation
--------------------
+~~~~~~~~~~~~~~~~~~~
To manually install Avocado and its dependencies, run:
@@ -823,26 +997,26 @@ Alternatively, follow the instructions on this link:
https://avocado-framework.readthedocs.io/en/latest/guides/user/chapters/installing.html
Overview
---------
+~~~~~~~~
-The ``tests/acceptance/avocado_qemu`` directory provides the
+The ``tests/avocado/avocado_qemu`` directory provides the
``avocado_qemu`` Python module, containing the ``avocado_qemu.Test``
class. Here's a simple usage example:
.. code::
- from avocado_qemu import Test
+ from avocado_qemu import QemuSystemTest
- class Version(Test):
+ class Version(QemuSystemTest):
"""
:avocado: tags=quick
"""
def test_qmp_human_info_version(self):
self.vm.launch()
- res = self.vm.command('human-monitor-command',
- command_line='info version')
- self.assertRegexpMatches(res, r'^(\d+\.\d+\.\d)')
+ res = self.vm.cmd('human-monitor-command',
+ command_line='info version')
+ self.assertRegex(res, r'^(\d+\.\d+\.\d)')
To execute your test, run:
@@ -859,7 +1033,7 @@ in the current directory, tagged as "quick", run:
avocado run -t quick .
The ``avocado_qemu.Test`` base test class
------------------------------------------
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``avocado_qemu.Test`` class has a number of characteristics that
are worth being mentioned right away.
@@ -879,10 +1053,10 @@ and hypothetical example follows:
.. code::
- from avocado_qemu import Test
+ from avocado_qemu import QemuSystemTest
- class MultipleMachines(Test):
+ class MultipleMachines(QemuSystemTest):
def test_multiple_machines(self):
first_machine = self.get_vm()
second_machine = self.get_vm()
@@ -891,25 +1065,25 @@ and hypothetical example follows:
first_machine.launch()
second_machine.launch()
- first_res = first_machine.command(
+ first_res = first_machine.cmd(
'human-monitor-command',
command_line='info version')
- second_res = second_machine.command(
+ second_res = second_machine.cmd(
'human-monitor-command',
command_line='info version')
- third_res = self.get_vm(name='third_machine').command(
+ third_res = self.get_vm(name='third_machine').cmd(
'human-monitor-command',
command_line='info version')
- self.assertEquals(first_res, second_res, third_res)
+ self.assertEqual(first_res, second_res, third_res)
At test "tear down", ``avocado_qemu.Test`` handles all the QEMUMachines
shutdown.
The ``avocado_qemu.LinuxTest`` base test class
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The ``avocado_qemu.LinuxTest`` is further specialization of the
``avocado_qemu.Test`` class, so it contains all the characteristics of
@@ -932,7 +1106,7 @@ like this:
self.ssh_command('some_command_to_be_run_in_the_guest')
Please refer to tests that use ``avocado_qemu.LinuxTest`` under
-``tests/acceptance`` for more examples.
+``tests/avocado`` for more examples.
QEMUMachine
~~~~~~~~~~~
@@ -952,7 +1126,7 @@ execution of a QEMU binary, giving its users:
a more succinct and intuitive way
QEMU binary selection
-~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^^^
The QEMU binary used for the ``self.vm`` QEMUMachine instance will
primarily depend on the value of the ``qemu_bin`` parameter. If it's
@@ -973,20 +1147,23 @@ The resulting ``qemu_bin`` value will be preserved in the
``avocado_qemu.Test`` as an attribute with the same name.
Attribute reference
--------------------
+~~~~~~~~~~~~~~~~~~~
+
+Test
+^^^^
Besides the attributes and methods that are part of the base
``avocado.Test`` class, the following attributes are available on any
``avocado_qemu.Test`` instance.
vm
-~~
+''
A QEMUMachine instance, initially configured according to the given
``qemu_bin`` parameter.
arch
-~~~~
+''''
The architecture can be used on different levels of the stack, e.g. by
the framework or by the test itself. At the framework level, it will
@@ -1003,7 +1180,7 @@ name. If one is not given explicitly, it will either be set to
``:avocado: tags=arch:VALUE`` tag, it will be set to ``VALUE``.
cpu
-~~~
+'''
The cpu model that will be set to all QEMUMachine instances created
by the test.
@@ -1014,7 +1191,7 @@ name. If one is not given explicitly, it will either be set to
``:avocado: tags=cpu:VALUE`` tag, it will be set to ``VALUE``.
machine
-~~~~~~~
+'''''''
The machine type that will be set to all QEMUMachine instances created
by the test.
@@ -1025,20 +1202,20 @@ name. If one is not given explicitly, it will either be set to
``:avocado: tags=machine:VALUE`` tag, it will be set to ``VALUE``.
qemu_bin
-~~~~~~~~
+''''''''
The preserved value of the ``qemu_bin`` parameter or the result of the
dynamic probe for a QEMU binary in the current working directory or
source tree.
LinuxTest
-~~~~~~~~~
+^^^^^^^^^
Besides the attributes present on the ``avocado_qemu.Test`` base
class, the ``avocado_qemu.LinuxTest`` adds the following attributes:
distro
-......
+''''''
The name of the Linux distribution used as the guest image for the
test. The name should match the **Provider** column on the list
@@ -1047,7 +1224,7 @@ of images supported by the avocado.utils.vmimage library:
https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images
distro_version
-..............
+''''''''''''''
The version of the Linux distribution as the guest image for the
test. The name should match the **Version** column on the list
@@ -1056,7 +1233,7 @@ of images supported by the avocado.utils.vmimage library:
https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images
distro_checksum
-...............
+'''''''''''''''
The sha256 hash of the guest image file used for the test.
@@ -1065,7 +1242,7 @@ same name), no validation on the integrity of the image will be
performed.
Parameter reference
--------------------
+~~~~~~~~~~~~~~~~~~~
To understand how Avocado parameters are accessed by tests, and how
they can be passed to tests, please refer to::
@@ -1079,8 +1256,11 @@ like the following:
PARAMS (key=qemu_bin, path=*, default=./qemu-system-x86_64) => './qemu-system-x86_64
+Test
+^^^^
+
arch
-~~~~
+''''
The architecture that will influence the selection of a QEMU binary
(when one is not explicitly given).
@@ -1093,31 +1273,30 @@ This parameter has a direct relation with the ``arch`` attribute. If
not given, it will default to None.
cpu
-~~~
+'''
The cpu model that will be set to all QEMUMachine instances created
by the test.
machine
-~~~~~~~
+'''''''
The machine type that will be set to all QEMUMachine instances created
by the test.
-
qemu_bin
-~~~~~~~~
+''''''''
The exact QEMU binary to be used on QEMUMachine.
LinuxTest
-~~~~~~~~~
+^^^^^^^^^
Besides the parameters present on the ``avocado_qemu.Test`` base
class, the ``avocado_qemu.LinuxTest`` adds the following parameters:
distro
-......
+''''''
The name of the Linux distribution used as the guest image for the
test. The name should match the **Provider** column on the list
@@ -1126,7 +1305,7 @@ of images supported by the avocado.utils.vmimage library:
https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images
distro_version
-..............
+''''''''''''''
The version of the Linux distribution as the guest image for the
test. The name should match the **Version** column on the list
@@ -1135,7 +1314,7 @@ of images supported by the avocado.utils.vmimage library:
https://avocado-framework.readthedocs.io/en/latest/guides/writer/libs/vmimage.html#supported-images
distro_checksum
-...............
+'''''''''''''''
The sha256 hash of the guest image file used for the test.
@@ -1143,7 +1322,8 @@ If this value is not set in the code or by this parameter no
validation on the integrity of the image will be performed.
Skipping tests
---------------
+~~~~~~~~~~~~~~
+
The Avocado framework provides Python decorators which allow for easily skip
tests running under certain conditions. For example, on the lack of a binary
on the test system or when the running environment is a CI system. For further
@@ -1158,7 +1338,7 @@ environment variables became a kind of standard way to enable/disable tests.
Here is a list of the most used variables:
AVOCADO_ALLOW_LARGE_STORAGE
-~~~~~~~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
Tests which are going to fetch or produce assets considered *large* are not
going to run unless that ``AVOCADO_ALLOW_LARGE_STORAGE=1`` is exported on
the environment.
@@ -1166,8 +1346,19 @@ the environment.
The definition of *large* is a bit arbitrary here, but it usually means an
asset which occupies at least 1GB of size on disk when uncompressed.
+SPEED
+^^^^^
+Tests which have a long runtime will not be run unless ``SPEED=slow`` is
+exported on the environment.
+
+The definition of *long* is a bit arbitrary here, and it depends on the
+usefulness of the test too. A unique test is worth spending more time on,
+small variations on existing tests perhaps less so. As a rough guide,
+a test or set of similar tests which take more than 100 seconds to
+complete.
+
AVOCADO_ALLOW_UNTRUSTED_CODE
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
There are tests which will boot a kernel image or firmware that can be
considered not safe to run on the developer's workstation, thus they are
skipped by default. The definition of *not safe* is also arbitrary but
@@ -1178,7 +1369,7 @@ You should export ``AVOCADO_ALLOW_UNTRUSTED_CODE=1`` on the environment in
order to allow tests which make use of those kind of assets.
AVOCADO_TIMEOUT_EXPECTED
-~~~~~~~~~~~~~~~~~~~~~~~~
+^^^^^^^^^^^^^^^^^^^^^^^^
The Avocado framework has a timeout mechanism which interrupts tests to avoid the
test suite of getting stuck. The timeout value can be set via test parameter or
property defined in the test class, for further details::
@@ -1191,21 +1382,36 @@ conditions. For example, tests that take longer to execute when QEMU is
compiled with debug flags. Therefore, the ``AVOCADO_TIMEOUT_EXPECTED`` variable
has been used to determine whether those tests should run or not.
-GITLAB_CI
-~~~~~~~~~
-A number of tests are flagged to not run on the GitLab CI. Usually because
-they proved to the flaky or there are constraints on the CI environment which
-would make them fail. If you encounter a similar situation then use that
-variable as shown on the code snippet below to skip the test:
+QEMU_TEST_FLAKY_TESTS
+^^^^^^^^^^^^^^^^^^^^^
+Some tests are not working reliably and thus are disabled by default.
+This includes tests that don't run reliably on GitLab's CI which
+usually expose real issues that are rarely seen on developer machines
+due to the constraints of the CI environment. If you encounter a
+similar situation then raise a bug and then mark the test as shown on
+the code snippet below:
.. code::
- @skipIf(os.getenv('GITLAB_CI'), 'Running on GitLab')
+ # See https://gitlab.com/qemu-project/qemu/-/issues/nnnn
+ @skipUnless(os.getenv('QEMU_TEST_FLAKY_TESTS'), 'Test is unstable on GitLab')
def test(self):
do_something()
+You can also add ``:avocado: tags=flaky`` to the test meta-data so
+only the flaky tests can be run as a group:
+
+.. code::
+
+ env QEMU_TEST_FLAKY_TESTS=1 ./pyvenv/bin/avocado \
+ run tests/avocado -filter-by-tags=flaky
+
+Tests should not live in this state forever and should either be fixed
+or eventually removed.
+
+
Uninstalling Avocado
---------------------
+~~~~~~~~~~~~~~~~~~~~
If you've followed the manual installation instructions above, you can
easily uninstall Avocado. Start by listing the packages you have
@@ -1217,13 +1423,13 @@ And remove any package you want with::
pip uninstall <package_name>
-If you've used ``make check-acceptance``, the Python virtual environment where
+If you've used ``make check-avocado``, the Python virtual environment where
Avocado is installed will be cleaned up as part of ``make check-clean``.
.. _checktcg-ref:
Testing with "make check-tcg"
-=============================
+-----------------------------
The check-tcg tests are intended for simple smoke tests of both
linux-user and softmmu TCG functionality. However to build test
@@ -1240,7 +1446,7 @@ for the architecture in question, for example::
$(configure) --cross-cc-aarch64=aarch64-cc
-There is also a ``--cross-cc-flags-ARCH`` flag in case additional
+There is also a ``--cross-cc-cflags-ARCH`` flag in case additional
compiler flags are needed to build for a given target.
If you have the ability to run containers as the user the build system
@@ -1256,7 +1462,7 @@ itself.
See :ref:`container-ref` for more details.
Running subset of tests
------------------------
+~~~~~~~~~~~~~~~~~~~~~~~
You can build the tests for one architecture::
@@ -1270,10 +1476,10 @@ Adding ``V=1`` to the invocation will show the details of how to
invoke QEMU for the test which is useful for debugging tests.
TCG test dependencies
----------------------
+~~~~~~~~~~~~~~~~~~~~~
The TCG tests are deliberately very light on dependencies and are
-either totally bare with minimal gcc lib support (for softmmu tests)
+either totally bare with minimal gcc lib support (for system-mode tests)
or just glibc (for linux-user tests). This is because getting a cross
compiler to work with additional libraries can be challenging.
@@ -1302,3 +1508,22 @@ exercise as many corner cases as possible. It is a useful test suite
to run to exercise QEMU's linux-user code::
https://linux-test-project.github.io/
+
+GCC gcov support
+----------------
+
+``gcov`` is a GCC tool to analyze the testing coverage by
+instrumenting the tested code. To use it, configure QEMU with
+``--enable-gcov`` option and build. Then run the tests as usual.
+
+If you want to gather coverage information on a single test the ``make
+clean-gcda`` target can be used to delete any existing coverage
+information before running a single test.
+
+You can generate a HTML coverage report by executing ``make
+coverage-html`` which will create
+``meson-logs/coveragereport/index.html``.
+
+Further analysis can be conducted by running the ``gcov`` command
+directly on the various .gcda output files. Please read the ``gcov``
+documentation for more information.
diff --git a/docs/devel/tracing.rst b/docs/devel/tracing.rst
index ba83954899..043bed7fd0 100644
--- a/docs/devel/tracing.rst
+++ b/docs/devel/tracing.rst
@@ -1,3 +1,5 @@
+.. _tracing:
+
=======
Tracing
=======
@@ -46,7 +48,7 @@ file. During build, the "trace-events" file in each listed subdirectory will be
processed by the "tracetool" script to generate code for the trace events.
The individual "trace-events" files are merged into a "trace-events-all" file,
-which is also installed into "/usr/share/qemu" with the name "trace-events".
+which is also installed into "/usr/share/qemu".
This merged file is to be used by the "simpletrace.py" script to later analyse
traces in the simpletrace data format.
@@ -355,8 +357,7 @@ probes::
scripts/tracetool.py --backends=dtrace --format=stap \
--binary path/to/qemu-binary \
- --target-type system \
- --target-name x86_64 \
+ --probe-prefix qemu.system.x86_64 \
--group=all \
trace-events-all \
qemu.stp
@@ -411,88 +412,3 @@ disabled, this check will have no performance impact.
return ptr;
}
-"tcg"
------
-
-Guest code generated by TCG can be traced by defining an event with the "tcg"
-event property. Internally, this property generates two events:
-"<eventname>_trans" to trace the event at translation time, and
-"<eventname>_exec" to trace the event at execution time.
-
-Instead of using these two events, you should instead use the function
-"trace_<eventname>_tcg" during translation (TCG code generation). This function
-will automatically call "trace_<eventname>_trans", and will generate the
-necessary TCG code to call "trace_<eventname>_exec" during guest code execution.
-
-Events with the "tcg" property can be declared in the "trace-events" file with a
-mix of native and TCG types, and "trace_<eventname>_tcg" will gracefully forward
-them to the "<eventname>_trans" and "<eventname>_exec" events. Since TCG values
-are not known at translation time, these are ignored by the "<eventname>_trans"
-event. Because of this, the entry in the "trace-events" file needs two printing
-formats (separated by a comma)::
-
- tcg foo(uint8_t a1, TCGv_i32 a2) "a1=%d", "a1=%d a2=%d"
-
-For example::
-
- #include "trace-tcg.h"
-
- void some_disassembly_func (...)
- {
- uint8_t a1 = ...;
- TCGv_i32 a2 = ...;
- trace_foo_tcg(a1, a2);
- }
-
-This will immediately call::
-
- void trace_foo_trans(uint8_t a1);
-
-and will generate the TCG code to call::
-
- void trace_foo(uint8_t a1, uint32_t a2);
-
-"vcpu"
-------
-
-Identifies events that trace vCPU-specific information. It implicitly adds a
-"CPUState*" argument, and extends the tracing print format to show the vCPU
-information. If used together with the "tcg" property, it adds a second
-"TCGv_env" argument that must point to the per-target global TCG register that
-points to the vCPU when guest code is executed (usually the "cpu_env" variable).
-
-The "tcg" and "vcpu" properties are currently only honored in the root
-./trace-events file.
-
-The following example events::
-
- foo(uint32_t a) "a=%x"
- vcpu bar(uint32_t a) "a=%x"
- tcg vcpu baz(uint32_t a) "a=%x", "a=%x"
-
-Can be used as::
-
- #include "trace-tcg.h"
-
- CPUArchState *env;
- TCGv_ptr cpu_env;
-
- void some_disassembly_func(...)
- {
- /* trace emitted at this point */
- trace_foo(0xd1);
- /* trace emitted at this point */
- trace_bar(env_cpu(env), 0xd2);
- /* trace emitted at this point (env) and when guest code is executed (cpu_env) */
- trace_baz_tcg(env_cpu(env), cpu_env, 0xd3);
- }
-
-If the translating vCPU has address 0xc1 and code is later executed by vCPU
-0xc2, this would be an example output::
-
- // at guest code translation
- foo a=0xd1
- bar cpu=0xc1 a=0xd2
- baz_trans cpu=0xc1 a=0xd3
- // at guest code execution
- baz_exec cpu=0xc2 a=0xd3
diff --git a/docs/devel/trivial-patches.rst b/docs/devel/trivial-patches.rst
new file mode 100644
index 0000000000..9380c730f7
--- /dev/null
+++ b/docs/devel/trivial-patches.rst
@@ -0,0 +1,52 @@
+.. _trivial-patches:
+
+Trivial Patches
+===============
+
+Overview
+--------
+
+Trivial patches that change just a few lines of code sometimes languish
+on the mailing list even though they require only a small amount of
+review. This is often the case for patches that do not fall under an
+actively maintained subsystem and therefore fall through the cracks.
+
+The trivial patches team take on the task of reviewing and building pull
+requests for patches that:
+
+- Do not fall under an actively maintained subsystem.
+- Are single patches or short series (max 2-4 patches).
+- Only touch a few lines of code.
+
+**You should hint that your patch is a candidate by CCing
+qemu-trivial@nongnu.org.**
+
+Repositories
+------------
+
+Since the trivial patch team rotates maintainership there is only one
+active repository at a time:
+
+- git://github.com/vivier/qemu.git trivial-patches - `browse <https://github.com/vivier/qemu/tree/trivial-patches>`__
+
+Workflow
+--------
+
+The trivial patches team rotates the duty of collecting trivial patches
+amongst its members. A team member's job is to:
+
+1. Identify trivial patches on the development mailing list.
+2. Review trivial patches, merge them into a git tree, and reply to state
+ that the patch is queued.
+3. Send pull requests to the development mailing list once a week.
+
+A single team member can be on duty as long as they like. The suggested
+time is 1 week before handing off to the next member.
+
+Team
+----
+
+If you would like to join the trivial patches team, contact Laurent
+Vivier. The current team includes:
+
+- `Laurent Vivier <mailto:laurent@vivier.eu>`__
diff --git a/docs/devel/ui.rst b/docs/devel/ui.rst
index 06c7d622ce..17fb667dec 100644
--- a/docs/devel/ui.rst
+++ b/docs/devel/ui.rst
@@ -1,8 +1,8 @@
=================
-Qemu UI subsystem
+QEMU UI subsystem
=================
-Qemu Clipboard
+QEMU Clipboard
--------------
.. kernel-doc:: include/ui/clipboard.h
diff --git a/docs/devel/vfio-iommufd.rst b/docs/devel/vfio-iommufd.rst
new file mode 100644
index 0000000000..3d1c11f175
--- /dev/null
+++ b/docs/devel/vfio-iommufd.rst
@@ -0,0 +1,166 @@
+===============================
+IOMMUFD BACKEND usage with VFIO
+===============================
+
+(Same meaning for backend/container/BE)
+
+With the introduction of iommufd, the Linux kernel provides a generic
+interface for user space drivers to propagate their DMA mappings to kernel
+for assigned devices. While the legacy kernel interface is group-centric,
+the new iommufd interface is device-centric, relying on device fd and iommufd.
+
+To support both interfaces in the QEMU VFIO device, introduce a base container
+to abstract the common part of VFIO legacy and iommufd container. So that the
+generic VFIO code can use either container.
+
+The base container implements generic functions such as memory_listener and
+address space management whereas the derived container implements callbacks
+specific to either legacy or iommufd. Each container has its own way to setup
+secure context and dma management interface. The below diagram shows how it
+looks like with both containers.
+
+::
+
+ VFIO AddressSpace/Memory
+ +-------+ +----------+ +-----+ +-----+
+ | pci | | platform | | ap | | ccw |
+ +---+---+ +----+-----+ +--+--+ +--+--+ +----------------------+
+ | | | | | AddressSpace |
+ | | | | +------------+---------+
+ +---V-----------V-----------V--------V----+ /
+ | VFIOAddressSpace | <------------+
+ | | | MemoryListener
+ | VFIOContainerBase list |
+ +-------+----------------------------+----+
+ | |
+ | |
+ +-------V------+ +--------V----------+
+ | iommufd | | vfio legacy |
+ | container | | container |
+ +-------+------+ +--------+----------+
+ | |
+ | /dev/iommu | /dev/vfio/vfio
+ | /dev/vfio/devices/vfioX | /dev/vfio/$group_id
+ Userspace | |
+ ============+============================+===========================
+ Kernel | device fd |
+ +---------------+ | group/container fd
+ | (BIND_IOMMUFD | | (SET_CONTAINER/SET_IOMMU)
+ | ATTACH_IOAS) | | device fd
+ | | |
+ | +-------V------------V-----------------+
+ iommufd | | vfio |
+ (map/unmap | +---------+--------------------+-------+
+ ioas_copy) | | | map/unmap
+ | | |
+ +------V------+ +-----V------+ +------V--------+
+ | iommfd core | | device | | vfio iommu |
+ +-------------+ +------------+ +---------------+
+
+* Secure Context setup
+
+ - iommufd BE: uses device fd and iommufd to setup secure context
+ (bind_iommufd, attach_ioas)
+ - vfio legacy BE: uses group fd and container fd to setup secure context
+ (set_container, set_iommu)
+
+* Device access
+
+ - iommufd BE: device fd is opened through ``/dev/vfio/devices/vfioX``
+ - vfio legacy BE: device fd is retrieved from group fd ioctl
+
+* DMA Mapping flow
+
+ 1. VFIOAddressSpace receives MemoryRegion add/del via MemoryListener
+ 2. VFIO populates DMA map/unmap via the container BEs
+ * iommufd BE: uses iommufd
+ * vfio legacy BE: uses container fd
+
+Example configuration
+=====================
+
+Step 1: configure the host device
+---------------------------------
+
+It's exactly same as the VFIO device with legacy VFIO container.
+
+Step 2: configure QEMU
+----------------------
+
+Interactions with the ``/dev/iommu`` are abstracted by a new iommufd
+object (compiled in with the ``CONFIG_IOMMUFD`` option).
+
+Any QEMU device (e.g. VFIO device) wishing to use ``/dev/iommu`` must
+be linked with an iommufd object. It gets a new optional property
+named iommufd which allows to pass an iommufd object. Take ``vfio-pci``
+device for example:
+
+.. code-block:: bash
+
+ -object iommufd,id=iommufd0
+ -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0
+
+Note the ``/dev/iommu`` and VFIO cdev can be externally opened by a
+management layer. In such a case the fd is passed, the fd supports a
+string naming the fd or a number, for example:
+
+.. code-block:: bash
+
+ -object iommufd,id=iommufd0,fd=22
+ -device vfio-pci,iommufd=iommufd0,fd=23
+
+If the ``fd`` property is not passed, the fd is opened by QEMU.
+
+If no ``iommufd`` object is passed to the ``vfio-pci`` device, iommufd
+is not used and the user gets the behavior based on the legacy VFIO
+container:
+
+.. code-block:: bash
+
+ -device vfio-pci,host=0000:02:00.0
+
+Supported platform
+==================
+
+Supports x86, ARM and s390x currently.
+
+Caveats
+=======
+
+Dirty page sync
+---------------
+
+Dirty page sync with iommufd backend is unsupported yet, live migration is
+disabled by default. But it can be force enabled like below, low efficient
+though.
+
+.. code-block:: bash
+
+ -object iommufd,id=iommufd0
+ -device vfio-pci,host=0000:02:00.0,iommufd=iommufd0,enable-migration=on
+
+P2P DMA
+-------
+
+PCI p2p DMA is unsupported as IOMMUFD doesn't support mapping hardware PCI
+BAR region yet. Below warning shows for assigned PCI device, it's not a bug.
+
+.. code-block:: none
+
+ qemu-system-x86_64: warning: IOMMU_IOAS_MAP failed: Bad address, PCI BAR?
+ qemu-system-x86_64: vfio_container_dma_map(0x560cb6cb1620, 0xe000000021000, 0x3000, 0x7f32ed55c000) = -14 (Bad address)
+
+FD passing with mdev
+--------------------
+
+``vfio-pci`` device checks sysfsdev property to decide if backend is a mdev.
+If FD passing is used, there is no way to know that and the mdev is treated
+like a real PCI device. There is an error as below if user wants to enable
+RAM discarding for mdev.
+
+.. code-block:: none
+
+ qemu-system-x86_64: -device vfio-pci,iommufd=iommufd0,x-balloon-allowed=on,fd=9: vfio VFIO_FD9: x-balloon-allowed only potentially compatible with mdev devices
+
+``vfio-ap`` and ``vfio-ccw`` devices don't have same issue as their backend
+devices are always mdev and RAM discarding is force enabled.
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
deleted file mode 100644
index 9ff6163c88..0000000000
--- a/docs/devel/vfio-migration.rst
+++ /dev/null
@@ -1,150 +0,0 @@
-=====================
-VFIO device Migration
-=====================
-
-Migration of virtual machine involves saving the state for each device that
-the guest is running on source host and restoring this saved state on the
-destination host. This document details how saving and restoring of VFIO
-devices is done in QEMU.
-
-Migration of VFIO devices consists of two phases: the optional pre-copy phase,
-and the stop-and-copy phase. The pre-copy phase is iterative and allows to
-accommodate VFIO devices that have a large amount of data that needs to be
-transferred. The iterative pre-copy phase of migration allows for the guest to
-continue whilst the VFIO device state is transferred to the destination, this
-helps to reduce the total downtime of the VM. VFIO devices can choose to skip
-the pre-copy phase of migration by returning pending_bytes as zero during the
-pre-copy phase.
-
-A detailed description of the UAPI for VFIO device migration can be found in
-the comment for the ``vfio_device_migration_info`` structure in the header
-file linux-headers/linux/vfio.h.
-
-VFIO implements the device hooks for the iterative approach as follows:
-
-* A ``save_setup`` function that sets up the migration region and sets _SAVING
- flag in the VFIO device state.
-
-* A ``load_setup`` function that sets up the migration region on the
- destination and sets _RESUMING flag in the VFIO device state.
-
-* A ``save_live_pending`` function that reads pending_bytes from the vendor
- driver, which indicates the amount of data that the vendor driver has yet to
- save for the VFIO device.
-
-* A ``save_live_iterate`` function that reads the VFIO device's data from the
- vendor driver through the migration region during iterative phase.
-
-* A ``save_state`` function to save the device config space if it is present.
-
-* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
- VFIO device state and iteratively copies the remaining data for the VFIO
- device until the vendor driver indicates that no data remains (pending bytes
- is zero).
-
-* A ``load_state`` function that loads the config section and the data
- sections that are generated by the save functions above
-
-* ``cleanup`` functions for both save and load that perform any migration
- related cleanup, including unmapping the migration region
-
-
-The VFIO migration code uses a VM state change handler to change the VFIO
-device state when the VM state changes from running to not-running, and
-vice versa.
-
-Similarly, a migration state change handler is used to trigger a transition of
-the VFIO device state when certain changes of the migration state occur. For
-example, the VFIO device state is transitioned back to _RUNNING in case a
-migration failed or was canceled.
-
-System memory dirty pages tracking
-----------------------------------
-
-A ``log_global_start`` and ``log_global_stop`` memory listener callback informs
-the VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync``
-memory listener callback marks those system memory pages as dirty which are
-used for DMA by the VFIO device. The dirty pages bitmap is queried per
-container. All pages pinned by the vendor driver through external APIs have to
-be marked as dirty during migration. When there are CPU writes, CPU dirty page
-tracking can identify dirtied pages, but any page pinned by the vendor driver
-can also be written by the device. There is currently no device or IOMMU
-support for dirty page tracking in hardware.
-
-By default, dirty pages are tracked when the device is in pre-copy as well as
-stop-and-copy phase. So, a page pinned by the vendor driver will be copied to
-the destination in both phases. Copying dirty pages in pre-copy phase helps
-QEMU to predict if it can achieve its downtime tolerances. If QEMU during
-pre-copy phase keeps finding dirty pages continuously, then it understands
-that even in stop-and-copy phase, it is likely to find dirty pages and can
-predict the downtime accordingly.
-
-QEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
-which disables querying the dirty bitmap during pre-copy phase. If it is set to
-off, all dirty pages will be copied to the destination in stop-and-copy phase
-only.
-
-System memory dirty pages tracking when vIOMMU is enabled
----------------------------------------------------------
-
-With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
-phase of migration. In that case, the unmap ioctl returns any dirty pages in
-that range and QEMU reports corresponding guest physical pages dirty. During
-stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
-pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
-mapped ranges.
-
-Flow of state changes during Live migration
-===========================================
-
-Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
-the VFIO device state, respectively.
-
-Live migration save path
-------------------------
-
-::
-
- QEMU normal running state
- (RUNNING, _NONE, _RUNNING)
- |
- migrate_init spawns migration_thread
- Migration thread then calls each device's .save_setup()
- (RUNNING, _SETUP, _RUNNING|_SAVING)
- |
- (RUNNING, _ACTIVE, _RUNNING|_SAVING)
- If device is active, get pending_bytes by .save_live_pending()
- If total pending_bytes >= threshold_size, call .save_live_iterate()
- Data of VFIO device for pre-copy phase is copied
- Iterate till total pending bytes converge and are less than threshold
- |
- On migration completion, vCPU stops and calls .save_live_complete_precopy for
- each active device. The VFIO device is then transitioned into _SAVING state
- (FINISH_MIGRATE, _DEVICE, _SAVING)
- |
- For the VFIO device, iterate in .save_live_complete_precopy until
- pending data is 0
- (FINISH_MIGRATE, _DEVICE, _STOPPED)
- |
- (FINISH_MIGRATE, _COMPLETED, _STOPPED)
- Migraton thread schedules cleanup bottom half and exits
-
-Live migration resume path
---------------------------
-
-::
-
- Incoming migration calls .load_setup for each device
- (RESTORE_VM, _ACTIVE, _STOPPED)
- |
- For each device, .load_state is called for that device section data
- (RESTORE_VM, _ACTIVE, _RESUMING)
- |
- At the end, .load_cleanup is called for each device and vCPUs are started
- (RUNNING, _NONE, _RUNNING)
-
-Postcopy
-========
-
-Postcopy migration is currently not supported for VFIO devices.
diff --git a/docs/devel/virtio-backends.rst b/docs/devel/virtio-backends.rst
new file mode 100644
index 0000000000..9ff092e7a0
--- /dev/null
+++ b/docs/devel/virtio-backends.rst
@@ -0,0 +1,214 @@
+..
+ Copyright (c) 2022, Linaro Limited
+ Written by Alex Bennée
+
+Writing VirtIO backends for QEMU
+================================
+
+This document attempts to outline the information a developer needs to
+know to write device emulations in QEMU. It is specifically focused on
+implementing VirtIO devices. For VirtIO the frontend is the driver
+running on the guest. The backend is the everything that QEMU needs to
+do to handle the emulation of the VirtIO device. This can be done
+entirely in QEMU, divided between QEMU and the kernel (vhost) or
+handled by a separate process which is configured by QEMU
+(vhost-user).
+
+VirtIO Transports
+-----------------
+
+VirtIO supports a number of different transports. While the details of
+the configuration and operation of the device will generally be the
+same QEMU represents them as different devices depending on the
+transport they use. For example -device virtio-foo represents the foo
+device using mmio and -device virtio-foo-pci is the same class of
+device using the PCI transport.
+
+Using the QEMU Object Model (QOM)
+---------------------------------
+
+Generally all devices in QEMU are super classes of ``TYPE_DEVICE``
+however VirtIO devices should be based on ``TYPE_VIRTIO_DEVICE`` which
+itself is derived from the base class. For example:
+
+.. code:: c
+
+ static const TypeInfo virtio_blk_info = {
+ .name = TYPE_VIRTIO_BLK,
+ .parent = TYPE_VIRTIO_DEVICE,
+ .instance_size = sizeof(VirtIOBlock),
+ .instance_init = virtio_blk_instance_init,
+ .class_init = virtio_blk_class_init,
+ };
+
+The author may decide to have a more expansive class hierarchy to
+support multiple device types. For example the Virtio GPU device:
+
+.. code:: c
+
+ static const TypeInfo virtio_gpu_base_info = {
+ .name = TYPE_VIRTIO_GPU_BASE,
+ .parent = TYPE_VIRTIO_DEVICE,
+ .instance_size = sizeof(VirtIOGPUBase),
+ .class_size = sizeof(VirtIOGPUBaseClass),
+ .class_init = virtio_gpu_base_class_init,
+ .abstract = true
+ };
+
+ static const TypeInfo vhost_user_gpu_info = {
+ .name = TYPE_VHOST_USER_GPU,
+ .parent = TYPE_VIRTIO_GPU_BASE,
+ .instance_size = sizeof(VhostUserGPU),
+ .instance_init = vhost_user_gpu_instance_init,
+ .instance_finalize = vhost_user_gpu_instance_finalize,
+ .class_init = vhost_user_gpu_class_init,
+ };
+
+ static const TypeInfo virtio_gpu_info = {
+ .name = TYPE_VIRTIO_GPU,
+ .parent = TYPE_VIRTIO_GPU_BASE,
+ .instance_size = sizeof(VirtIOGPU),
+ .class_size = sizeof(VirtIOGPUClass),
+ .class_init = virtio_gpu_class_init,
+ };
+
+defines a base class for the VirtIO GPU and then specialises two
+versions, one for the internal implementation and the other for the
+vhost-user version.
+
+VirtIOPCIProxy
+^^^^^^^^^^^^^^
+
+[AJB: the following is supposition and welcomes more informed
+opinions]
+
+Probably due to legacy from the pre-QOM days PCI VirtIO devices don't
+follow the normal hierarchy. Instead the a standalone object is based
+on the VirtIOPCIProxy class and the specific VirtIO instance is
+manually instantiated:
+
+.. code:: c
+
+ /*
+ * virtio-blk-pci: This extends VirtioPCIProxy.
+ */
+ #define TYPE_VIRTIO_BLK_PCI "virtio-blk-pci-base"
+ DECLARE_INSTANCE_CHECKER(VirtIOBlkPCI, VIRTIO_BLK_PCI,
+ TYPE_VIRTIO_BLK_PCI)
+
+ struct VirtIOBlkPCI {
+ VirtIOPCIProxy parent_obj;
+ VirtIOBlock vdev;
+ };
+
+ static Property virtio_blk_pci_properties[] = {
+ DEFINE_PROP_UINT32("class", VirtIOPCIProxy, class_code, 0),
+ DEFINE_PROP_BIT("ioeventfd", VirtIOPCIProxy, flags,
+ VIRTIO_PCI_FLAG_USE_IOEVENTFD_BIT, true),
+ DEFINE_PROP_UINT32("vectors", VirtIOPCIProxy, nvectors,
+ DEV_NVECTORS_UNSPECIFIED),
+ DEFINE_PROP_END_OF_LIST(),
+ };
+
+ static void virtio_blk_pci_realize(VirtIOPCIProxy *vpci_dev, Error **errp)
+ {
+ VirtIOBlkPCI *dev = VIRTIO_BLK_PCI(vpci_dev);
+ DeviceState *vdev = DEVICE(&dev->vdev);
+
+ ...
+
+ qdev_realize(vdev, BUS(&vpci_dev->bus), errp);
+ }
+
+ static void virtio_blk_pci_class_init(ObjectClass *klass, void *data)
+ {
+ DeviceClass *dc = DEVICE_CLASS(klass);
+ VirtioPCIClass *k = VIRTIO_PCI_CLASS(klass);
+ PCIDeviceClass *pcidev_k = PCI_DEVICE_CLASS(klass);
+
+ set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+ device_class_set_props(dc, virtio_blk_pci_properties);
+ k->realize = virtio_blk_pci_realize;
+ pcidev_k->vendor_id = PCI_VENDOR_ID_REDHAT_QUMRANET;
+ pcidev_k->device_id = PCI_DEVICE_ID_VIRTIO_BLOCK;
+ pcidev_k->revision = VIRTIO_PCI_ABI_VERSION;
+ pcidev_k->class_id = PCI_CLASS_STORAGE_SCSI;
+ }
+
+ static void virtio_blk_pci_instance_init(Object *obj)
+ {
+ VirtIOBlkPCI *dev = VIRTIO_BLK_PCI(obj);
+
+ virtio_instance_init_common(obj, &dev->vdev, sizeof(dev->vdev),
+ TYPE_VIRTIO_BLK);
+ object_property_add_alias(obj, "bootindex", OBJECT(&dev->vdev),
+ "bootindex");
+ }
+
+ static const VirtioPCIDeviceTypeInfo virtio_blk_pci_info = {
+ .base_name = TYPE_VIRTIO_BLK_PCI,
+ .generic_name = "virtio-blk-pci",
+ .transitional_name = "virtio-blk-pci-transitional",
+ .non_transitional_name = "virtio-blk-pci-non-transitional",
+ .instance_size = sizeof(VirtIOBlkPCI),
+ .instance_init = virtio_blk_pci_instance_init,
+ .class_init = virtio_blk_pci_class_init,
+ };
+
+Here you can see the instance_init has to manually instantiate the
+underlying ``TYPE_VIRTIO_BLOCK`` object and link an alias for one of
+it's properties to the PCI device.
+
+
+Back End Implementations
+------------------------
+
+There are a number of places where the implementation of the backend
+can be done:
+
+* in QEMU itself
+* in the host kernel (a.k.a vhost)
+* in a separate process (a.k.a. vhost-user)
+
+vhost_ops vs TYPE_VHOST_USER_BACKEND
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are two choices to how to implement vhost code. Most of the code
+which has to work with either vhost or vhost-user uses
+``vhost_dev_init()`` to instantiate the appropriate backend. This
+means including a ``struct vhost_dev`` in the main object structure.
+
+For vhost-user devices you also need to add code to track the
+initialisation of the ``chardev`` device used for the control socket
+between QEMU and the external vhost-user process.
+
+If you only need to implement a vhost-user backed the other option is
+a use a QOM-ified version of vhost-user.
+
+.. code:: c
+
+ static void
+ vhost_user_gpu_instance_init(Object *obj)
+ {
+ VhostUserGPU *g = VHOST_USER_GPU(obj);
+
+ g->vhost = VHOST_USER_BACKEND(object_new(TYPE_VHOST_USER_BACKEND));
+ object_property_add_alias(obj, "chardev",
+ OBJECT(g->vhost), "chardev");
+ }
+
+ static const TypeInfo vhost_user_gpu_info = {
+ .name = TYPE_VHOST_USER_GPU,
+ .parent = TYPE_VIRTIO_GPU_BASE,
+ .instance_size = sizeof(VhostUserGPU),
+ .instance_init = vhost_user_gpu_instance_init,
+ .instance_finalize = vhost_user_gpu_instance_finalize,
+ .class_init = vhost_user_gpu_class_init,
+ };
+
+Using it this way entails adding a ``struct VhostUserBackend`` to your
+core object structure and manually instantiating the backend. This
+sub-structure tracks both the ``vhost_dev`` and ``CharDev`` types
+needed for the connection. Instead of calling ``vhost_dev_init`` you
+would call ``vhost_user_backend_dev_init`` which does what is needed
+on your behalf.
diff --git a/docs/devel/virtio-migration.txt b/docs/devel/virtio-migration.txt
deleted file mode 100644
index 98a6b0ffb5..0000000000
--- a/docs/devel/virtio-migration.txt
+++ /dev/null
@@ -1,108 +0,0 @@
-Virtio devices and migration
-============================
-
-Copyright 2015 IBM Corp.
-
-This work is licensed under the terms of the GNU GPL, version 2 or later. See
-the COPYING file in the top-level directory.
-
-Saving and restoring the state of virtio devices is a bit of a twisty maze,
-for several reasons:
-- state is distributed between several parts:
- - virtio core, for common fields like features, number of queues, ...
- - virtio transport (pci, ccw, ...), for the different proxy devices and
- transport specific state (msix vectors, indicators, ...)
- - virtio device (net, blk, ...), for the different device types and their
- state (mac address, request queue, ...)
-- most fields are saved via the stream interface; subsequently, subsections
- have been added to make cross-version migration possible
-
-This file attempts to document the current procedure and point out some
-caveats.
-
-
-Save state procedure
-====================
-
-virtio core virtio transport virtio device
------------ ---------------- -------------
-
- save() function registered
- via VMState wrapper on
- device class
-virtio_save() <----------
- ------> save_config()
- - save proxy device
- - save transport-specific
- device fields
-- save common device
- fields
-- save common virtqueue
- fields
- ------> save_queue()
- - save transport-specific
- virtqueue fields
- ------> save_device()
- - save device-specific
- fields
-- save subsections
- - device endianness,
- if changed from
- default endianness
- - 64 bit features, if
- any high feature bit
- is set
- - virtio-1 virtqueue
- fields, if VERSION_1
- is set
-
-
-Load state procedure
-====================
-
-virtio core virtio transport virtio device
------------ ---------------- -------------
-
- load() function registered
- via VMState wrapper on
- device class
-virtio_load() <----------
- ------> load_config()
- - load proxy device
- - load transport-specific
- device fields
-- load common device
- fields
-- load common virtqueue
- fields
- ------> load_queue()
- - load transport-specific
- virtqueue fields
-- notify guest
- ------> load_device()
- - load device-specific
- fields
-- load subsections
- - device endianness
- - 64 bit features
- - virtio-1 virtqueue
- fields
-- sanitize endianness
-- sanitize features
-- virtqueue index sanity
- check
- - feature-dependent setup
-
-
-Implications of this setup
-==========================
-
-Devices need to be careful in their state processing during load: The
-load_device() procedure is invoked by the core before subsections have
-been loaded. Any code that depends on information transmitted in subsections
-therefore has to be invoked in the device's load() function _after_
-virtio_load() returned (like e.g. code depending on features).
-
-Any extension of the state being migrated should be done in subsections
-added to the core for compatibility reasons. If transport or device specific
-state is added, core needs to invoke a callback from the new subsection.
diff --git a/docs/devel/writing-monitor-commands.rst b/docs/devel/writing-monitor-commands.rst
new file mode 100644
index 0000000000..930da5cd06
--- /dev/null
+++ b/docs/devel/writing-monitor-commands.rst
@@ -0,0 +1,648 @@
+How to write monitor commands
+=============================
+
+This document is a step-by-step guide on how to write new QMP commands using
+the QAPI framework and HMP commands.
+
+This document doesn't discuss QMP protocol level details, nor does it dive
+into the QAPI framework implementation.
+
+For an in-depth introduction to the QAPI framework, please refer to
+:doc:`qapi-code-gen`. For the QMP protocol, see the
+:doc:`/interop/qmp-spec`.
+
+New commands may be implemented in QMP only. New HMP commands should be
+implemented on top of QMP. The typical HMP command wraps around an
+equivalent QMP command, but HMP convenience commands built from QMP
+building blocks are also fine. The long term goal is to make all
+existing HMP commands conform to this, to fully isolate HMP from the
+internals of QEMU. Refer to the `Writing a debugging aid returning
+unstructured text`_ section for further guidance on commands that
+would have traditionally been HMP only.
+
+Overview
+--------
+
+Generally speaking, the following steps should be taken in order to write a
+new QMP command.
+
+1. Define the command and any types it needs in the appropriate QAPI
+ schema module.
+
+2. Write the QMP command itself, which is a regular C function. Preferably,
+ the command should be exported by some QEMU subsystem. But it can also be
+ added to the monitor/qmp-cmds.c file
+
+3. At this point the command can be tested under the QMP protocol
+
+4. Write the HMP command equivalent. This is not required and should only be
+ done if it does make sense to have the functionality in HMP. The HMP command
+ is implemented in terms of the QMP command
+
+The following sections will demonstrate each of the steps above. We will start
+very simple and get more complex as we progress.
+
+
+Testing
+-------
+
+For all the examples in the next sections, the test setup is the same and is
+shown here.
+
+First, QEMU should be started like this::
+
+ # qemu-system-TARGET [...] \
+ -chardev socket,id=qmp,port=4444,host=localhost,server=on \
+ -mon chardev=qmp,mode=control,pretty=on
+
+Then, in a different terminal::
+
+ $ telnet localhost 4444
+ Trying 127.0.0.1...
+ Connected to localhost.
+ Escape character is '^]'.
+ {
+ "QMP": {
+ "version": {
+ "qemu": {
+ "micro": 50,
+ "minor": 2,
+ "major": 8
+ },
+ "package": ...
+ },
+ "capabilities": [
+ "oob"
+ ]
+ }
+ }
+
+The above output is the QMP server saying you're connected. The server is
+actually in capabilities negotiation mode. To enter in command mode type::
+
+ { "execute": "qmp_capabilities" }
+
+Then the server should respond::
+
+ {
+ "return": {
+ }
+ }
+
+Which is QMP's way of saying "the latest command executed OK and didn't return
+any data". Now you're ready to enter the QMP example commands as explained in
+the following sections.
+
+
+Writing a simple command: hello-world
+-------------------------------------
+
+That's the most simple QMP command that can be written. Usually, this kind of
+command carries some meaningful action in QEMU but here it will just print
+"Hello, world" to the standard output.
+
+Our command will be called "hello-world". It takes no arguments, nor does it
+return any data.
+
+The first step is defining the command in the appropriate QAPI schema
+module. We pick module qapi/misc.json, and add the following line at
+the bottom::
+
+ ##
+ # @hello-world:
+ #
+ # Since: 9.0
+ ##
+ { 'command': 'hello-world' }
+
+The "command" keyword defines a new QMP command. It instructs QAPI to
+generate any prototypes and the necessary code to marshal and unmarshal
+protocol data.
+
+The next step is to write the "hello-world" implementation. As explained
+earlier, it's preferable for commands to live in QEMU subsystems. But
+"hello-world" doesn't pertain to any, so we put its implementation in
+monitor/qmp-cmds.c::
+
+ void qmp_hello_world(Error **errp)
+ {
+ printf("Hello, world!\n");
+ }
+
+There are a few things to be noticed:
+
+1. QMP command implementation functions must be prefixed with "qmp\_"
+2. qmp_hello_world() returns void, this is in accordance with the fact that the
+ command doesn't return any data
+3. It takes an "Error \*\*" argument. This is required. Later we will see how to
+ return errors and take additional arguments. The Error argument should not
+ be touched if the command doesn't return errors
+4. We won't add the function's prototype. That's automatically done by QAPI
+5. Printing to the terminal is discouraged for QMP commands, we do it here
+ because it's the easiest way to demonstrate a QMP command
+
+You're done. Now build QEMU, run it as suggested in the "Testing" section,
+and then type the following QMP command::
+
+ { "execute": "hello-world" }
+
+Then check the terminal running QEMU and look for the "Hello, world" string. If
+you don't see it then something went wrong.
+
+
+Arguments
+~~~~~~~~~
+
+Let's add arguments to our "hello-world" command.
+
+The first change we have to do is to modify the command specification in the
+schema file to the following::
+
+ ##
+ # @hello-world:
+ #
+ # @message: message to be printed (default: "Hello, world!")
+ #
+ # @times: how many times to print the message (default: 1)
+ #
+ # Since: 9.0
+ ##
+ { 'command': 'hello-world',
+ 'data': { '*message': 'str', '*times': 'int' } }
+
+Notice the new 'data' member in the schema. It specifies an argument
+'message' of QAPI type 'str', and an argument 'times' of QAPI type
+'int'. Also notice the asterisk, it's used to mark the argument
+optional.
+
+Now, let's update our C implementation in monitor/qmp-cmds.c::
+
+ void qmp_hello_world(const char *message, bool has_times, int64_t times,
+ Error **errp)
+ {
+ if (!message) {
+ message = "Hello, world";
+ }
+ if (!has_times) {
+ times = 1;
+ }
+
+ for (int i = 0; i < times; i++) {
+ printf("%s\n", message);
+ }
+ }
+
+There are two important details to be noticed:
+
+1. Optional arguments other than pointers are accompanied by a 'has\_'
+ boolean, which is set if the optional argument is present or false
+ otherwise
+2. The C implementation signature must follow the schema's argument ordering,
+ which is defined by the "data" member
+
+Time to test our new version of the "hello-world" command. Build QEMU, run it as
+described in the "Testing" section and then send two commands::
+
+ { "execute": "hello-world" }
+ {
+ "return": {
+ }
+ }
+
+ { "execute": "hello-world", "arguments": { "message": "We love QEMU" } }
+ {
+ "return": {
+ }
+ }
+
+You should see "Hello, world" and "We love QEMU" in the terminal running QEMU,
+if you don't see these strings, then something went wrong.
+
+
+Errors
+~~~~~~
+
+QMP commands should use the error interface exported by the error.h header
+file. Basically, most errors are set by calling the error_setg() function.
+
+Let's say we don't accept the string "message" to contain the word "love". If
+it does contain it, we want the "hello-world" command to return an error::
+
+ void qmp_hello_world(const char *message, Error **errp)
+ {
+ if (message) {
+ if (strstr(message, "love")) {
+ error_setg(errp, "the word 'love' is not allowed");
+ return;
+ }
+ printf("%s\n", message);
+ } else {
+ printf("Hello, world\n");
+ }
+ }
+
+The first argument to the error_setg() function is the Error pointer
+to pointer, which is passed to all QMP functions. The next argument is a human
+description of the error, this is a free-form printf-like string.
+
+Let's test the example above. Build QEMU, run it as defined in the "Testing"
+section, and then issue the following command::
+
+ { "execute": "hello-world", "arguments": { "message": "all you need is love" } }
+
+The QMP server's response should be::
+
+ {
+ "error": {
+ "class": "GenericError",
+ "desc": "the word 'love' is not allowed"
+ }
+ }
+
+Note that error_setg() produces a "GenericError" class. In general,
+all QMP errors should have that error class. There are two exceptions
+to this rule:
+
+ 1. To support a management application's need to recognize a specific
+ error for special handling
+
+ 2. Backward compatibility
+
+If the failure you want to report falls into one of the two cases above,
+use error_set() with a second argument of an ErrorClass value.
+
+
+Implementing the HMP command
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the QMP command is in place, we can also make it available in the human
+monitor (HMP).
+
+With the introduction of QAPI, HMP commands make QMP calls. Most of the
+time HMP commands are simple wrappers.
+
+Here's the implementation of the "hello-world" HMP command::
+
+ void hmp_hello_world(Monitor *mon, const QDict *qdict)
+ {
+ const char *message = qdict_get_try_str(qdict, "message");
+ Error *err = NULL;
+
+ qmp_hello_world(!!message, message, &err);
+ if (hmp_handle_error(mon, err)) {
+ return;
+ }
+ }
+
+Add it to monitor/hmp-cmds.c. Also, add its prototype to
+include/monitor/hmp.h.
+
+There are four important points to be noticed:
+
+1. The "mon" and "qdict" arguments are mandatory for all HMP functions. The
+ former is the monitor object. The latter is how the monitor passes
+ arguments entered by the user to the command implementation
+2. We chose not to support the "times" argument in HMP
+3. hmp_hello_world() performs error checking. In this example we just call
+ hmp_handle_error() which prints a message to the user, but we could do
+ more, like taking different actions depending on the error
+ qmp_hello_world() returns
+4. The "err" variable must be initialized to NULL before performing the
+ QMP call
+
+There's one last step to actually make the command available to monitor users,
+we should add it to the hmp-commands.hx file::
+
+ {
+ .name = "hello-world",
+ .args_type = "message:s?",
+ .params = "hello-world [message]",
+ .help = "Print message to the standard output",
+ .cmd = hmp_hello_world,
+ },
+
+ SRST
+ ``hello_world`` *message*
+ Print message to the standard output
+ ERST
+
+To test this you have to open a user monitor and issue the "hello-world"
+command. It might be instructive to check the command's documentation with
+HMP's "help" command.
+
+Please check the "-monitor" command-line option to know how to open a user
+monitor.
+
+
+Writing more complex commands
+-----------------------------
+
+A QMP command is capable of returning any data QAPI supports like integers,
+strings, booleans, enumerations and user defined types.
+
+In this section we will focus on user defined types. Please check the QAPI
+documentation for information about the other types.
+
+
+Modelling data in QAPI
+~~~~~~~~~~~~~~~~~~~~~~
+
+For a QMP command that to be considered stable and supported long term,
+there is a requirement returned data should be explicitly modelled
+using fine-grained QAPI types. As a general guide, a caller of the QMP
+command should never need to parse individual returned data fields. If
+a field appears to need parsing, then it should be split into separate
+fields corresponding to each distinct data item. This should be the
+common case for any new QMP command that is intended to be used by
+machines, as opposed to exclusively human operators.
+
+Some QMP commands, however, are only intended as ad hoc debugging aids
+for human operators. While they may return large amounts of formatted
+data, it is not expected that machines will need to parse the result.
+The overhead of defining a fine grained QAPI type for the data may not
+be justified by the potential benefit. In such cases, it is permitted
+to have a command return a simple string that contains formatted data,
+however, it is mandatory for the command to be marked unstable.
+This indicates that the command is not guaranteed to be long term
+stable / liable to change in future and is not following QAPI design
+best practices. An example where this approach is taken is the QMP
+command "x-query-registers". This returns a formatted dump of the
+architecture specific CPU state. The way the data is formatted varies
+across QEMU targets, is liable to change over time, and is only
+intended to be consumed as an opaque string by machines. Refer to the
+`Writing a debugging aid returning unstructured text`_ section for
+an illustration.
+
+User Defined Types
+~~~~~~~~~~~~~~~~~~
+
+For this example we will write the query-option-roms command, which
+returns information about ROMs loaded into the option ROM space. For
+more information about it, please check the "-option-rom" command-line
+option.
+
+For each option ROM, we want to return two pieces of information: the
+ROM image's file name, and its bootindex, if any. We need to create a
+new QAPI type for that, as shown below::
+
+ ##
+ # @OptionRomInfo:
+ #
+ # @filename: option ROM image file name
+ #
+ # @bootindex: option ROM's bootindex
+ #
+ # Since: 9.0
+ ##
+ { 'struct': 'OptionRomInfo',
+ 'data': { 'filename': 'str', '*bootindex': 'int' } }
+
+The "struct" keyword defines a new QAPI type. Its "data" member
+contains the type's members. In this example our members are
+"filename" and "bootindex". The latter is optional.
+
+Now let's define the query-option-roms command::
+
+ ##
+ # @query-option-roms:
+ #
+ # Query information on ROMs loaded into the option ROM space.
+ #
+ # Returns: OptionRomInfo
+ #
+ # Since: 9.0
+ ##
+ { 'command': 'query-option-roms',
+ 'returns': ['OptionRomInfo'] }
+
+Notice the "returns" keyword. As its name suggests, it's used to define the
+data returned by a command.
+
+Notice the syntax ['OptionRomInfo']". This should be read as "returns
+a list of OptionRomInfo".
+
+It's time to implement the qmp_query_option_roms() function. Add to
+monitor/qmp-cmds.c::
+
+ OptionRomInfoList *qmp_query_option_roms(Error **errp)
+ {
+ OptionRomInfoList *info_list = NULL;
+ OptionRomInfoList **tailp = &info_list;
+ OptionRomInfo *info;
+
+ for (int i = 0; i < nb_option_roms; i++) {
+ info = g_malloc0(sizeof(*info));
+ info->filename = g_strdup(option_rom[i].name);
+ info->has_bootindex = option_rom[i].bootindex >= 0;
+ if (info->has_bootindex) {
+ info->bootindex = option_rom[i].bootindex;
+ }
+ QAPI_LIST_APPEND(tailp, info);
+ }
+
+ return info_list;
+ }
+
+There are a number of things to be noticed:
+
+1. Type OptionRomInfo is automatically generated by the QAPI framework,
+ its members correspond to the type's specification in the schema
+ file
+2. Type OptionRomInfoList is also generated. It's a singly linked
+ list.
+3. As specified in the schema file, the function returns a
+ OptionRomInfoList, and takes no arguments (besides the "errp" one,
+ which is mandatory for all QMP functions)
+4. The returned object is dynamically allocated
+5. All strings are dynamically allocated. This is so because QAPI also
+ generates a function to free its types and it cannot distinguish
+ between dynamically or statically allocated strings
+6. Remember that "bootindex" is optional? As a non-pointer optional
+ member, it comes with a 'has_bootindex' member that needs to be set
+ by the implementation, as shown above
+
+Time to test the new command. Build QEMU, run it as described in the "Testing"
+section and try this::
+
+ { "execute": "query-option-rom" }
+ {
+ "return": [
+ {
+ "filename": "kvmvapic.bin"
+ }
+ ]
+ }
+
+
+The HMP command
+~~~~~~~~~~~~~~~
+
+Here's the HMP counterpart of the query-option-roms command::
+
+ void hmp_info_option_roms(Monitor *mon, const QDict *qdict)
+ {
+ Error *err = NULL;
+ OptionRomInfoList *info_list, *tail;
+ OptionRomInfo *info;
+
+ info_list = qmp_query_option_roms(&err);
+ if (hmp_handle_error(mon, err)) {
+ return;
+ }
+
+ for (tail = info_list; tail; tail = tail->next) {
+ info = tail->value;
+ monitor_printf(mon, "%s", info->filename);
+ if (info->has_bootindex) {
+ monitor_printf(mon, " %" PRId64, info->bootindex);
+ }
+ monitor_printf(mon, "\n");
+ }
+
+ qapi_free_OptionRomInfoList(info_list);
+ }
+
+It's important to notice that hmp_info_option_roms() calls
+qapi_free_OptionRomInfoList() to free the data returned by
+qmp_query_option_roms(). For user defined types, QAPI will generate a
+qapi_free_QAPI_TYPE_NAME() function, and that's what you have to use to
+free the types you define and qapi_free_QAPI_TYPE_NAMEList() for list
+types (explained in the next section). If the QMP function returns a
+string, then you should g_free() to free it.
+
+Also note that hmp_info_option_roms() performs error handling. That's
+not strictly required when you're sure the QMP function doesn't return
+errors; you could instead pass it &error_abort then.
+
+Another important detail is that HMP's "info" commands go into
+hmp-commands-info.hx, not hmp-commands.hx. The entry for the "info
+option-roms" follows::
+
+ {
+ .name = "option-roms",
+ .args_type = "",
+ .params = "",
+ .help = "show roms",
+ .cmd = hmp_info_option_roms,
+ },
+ SRST
+ ``info option-roms``
+ Show the option ROMs.
+ ERST
+
+To test this, run QEMU and type "info option-roms" in the user monitor.
+
+
+Writing a debugging aid returning unstructured text
+---------------------------------------------------
+
+As discussed in section `Modelling data in QAPI`_, it is required that
+commands expecting machine usage be using fine-grained QAPI data types.
+The exception to this rule applies when the command is solely intended
+as a debugging aid and allows for returning unstructured text, such as
+a query command that report aspects of QEMU's internal state that are
+useful only to human operators.
+
+In this example we will consider the existing QMP command
+``x-query-roms`` in qapi/machine.json. It has no parameters and
+returns a ``HumanReadableText``::
+
+ ##
+ # @x-query-roms:
+ #
+ # Query information on the registered ROMS
+ #
+ # Features:
+ #
+ # @unstable: This command is meant for debugging.
+ #
+ # Returns: registered ROMs
+ #
+ # Since: 6.2
+ ##
+ { 'command': 'x-query-roms',
+ 'returns': 'HumanReadableText',
+ 'features': [ 'unstable' ] }
+
+The ``HumanReadableText`` struct is defined in qapi/common.json as a
+struct with a string member. It is intended to be used for all
+commands that are returning unstructured text targeted at
+humans. These should all have feature 'unstable'. Note that the
+feature's documentation states why the command is unstable. We
+commonly use a ``x-`` command name prefix to make lack of stability
+obvious to human users.
+
+Implementing the QMP command
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The QMP implementation will typically involve creating a ``GString``
+object and printing formatted data into it, like this::
+
+ HumanReadableText *qmp_x_query_roms(Error **errp)
+ {
+ g_autoptr(GString) buf = g_string_new("");
+ Rom *rom;
+
+ QTAILQ_FOREACH(rom, &roms, next) {
+ g_string_append_printf("%s size=0x%06zx name=\"%s\"\n",
+ memory_region_name(rom->mr),
+ rom->romsize,
+ rom->name);
+ }
+
+ return human_readable_text_from_str(buf);
+ }
+
+The actual implementation emits more information. You can find it in
+hw/core/loader.c.
+
+
+Implementing the HMP command
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now that the QMP command is in place, we can also make it available in
+the human monitor (HMP) as shown in previous examples. The HMP
+implementations will all look fairly similar, as all they need do is
+invoke the QMP command and then print the resulting text or error
+message. Here's an implementation of the "info roms" HMP command::
+
+ void hmp_info_roms(Monitor *mon, const QDict *qdict)
+ {
+ Error err = NULL;
+ g_autoptr(HumanReadableText) info = qmp_x_query_roms(&err);
+
+ if (hmp_handle_error(mon, err)) {
+ return;
+ }
+ monitor_puts(mon, info->human_readable_text);
+ }
+
+Also, you have to add the function's prototype to the hmp.h file.
+
+There's one last step to actually make the command available to
+monitor users, we should add it to the hmp-commands-info.hx file::
+
+ {
+ .name = "roms",
+ .args_type = "",
+ .params = "",
+ .help = "show roms",
+ .cmd = hmp_info_roms,
+ },
+
+The case of writing a HMP info handler that calls a no-parameter QMP query
+command is quite common. To simplify the implementation there is a general
+purpose HMP info handler for this scenario. All that is required to expose
+a no-parameter QMP query command via HMP is to declare it using the
+'.cmd_info_hrt' field to point to the QMP handler, and leave the '.cmd'
+field NULL::
+
+ {
+ .name = "roms",
+ .args_type = "",
+ .params = "",
+ .help = "show roms",
+ .cmd_info_hrt = qmp_x_query_roms,
+ },
+
+This is how the actual HMP command is done.
diff --git a/docs/devel/writing-qmp-commands.rst b/docs/devel/writing-qmp-commands.rst
deleted file mode 100644
index 6a10a06c48..0000000000
--- a/docs/devel/writing-qmp-commands.rst
+++ /dev/null
@@ -1,622 +0,0 @@
-How to write QMP commands using the QAPI framework
-==================================================
-
-This document is a step-by-step guide on how to write new QMP commands using
-the QAPI framework. It also shows how to implement new style HMP commands.
-
-This document doesn't discuss QMP protocol level details, nor does it dive
-into the QAPI framework implementation.
-
-For an in-depth introduction to the QAPI framework, please refer to
-docs/devel/qapi-code-gen.txt. For documentation about the QMP protocol,
-start with docs/interop/qmp-intro.txt.
-
-
-Overview
---------
-
-Generally speaking, the following steps should be taken in order to write a
-new QMP command.
-
-1. Define the command and any types it needs in the appropriate QAPI
- schema module.
-
-2. Write the QMP command itself, which is a regular C function. Preferably,
- the command should be exported by some QEMU subsystem. But it can also be
- added to the monitor/qmp-cmds.c file
-
-3. At this point the command can be tested under the QMP protocol
-
-4. Write the HMP command equivalent. This is not required and should only be
- done if it does make sense to have the functionality in HMP. The HMP command
- is implemented in terms of the QMP command
-
-The following sections will demonstrate each of the steps above. We will start
-very simple and get more complex as we progress.
-
-
-Testing
--------
-
-For all the examples in the next sections, the test setup is the same and is
-shown here.
-
-First, QEMU should be started like this::
-
- # qemu-system-TARGET [...] \
- -chardev socket,id=qmp,port=4444,host=localhost,server=on \
- -mon chardev=qmp,mode=control,pretty=on
-
-Then, in a different terminal::
-
- $ telnet localhost 4444
- Trying 127.0.0.1...
- Connected to localhost.
- Escape character is '^]'.
- {
- "QMP": {
- "version": {
- "qemu": {
- "micro": 50,
- "minor": 15,
- "major": 0
- },
- "package": ""
- },
- "capabilities": [
- ]
- }
- }
-
-The above output is the QMP server saying you're connected. The server is
-actually in capabilities negotiation mode. To enter in command mode type::
-
- { "execute": "qmp_capabilities" }
-
-Then the server should respond::
-
- {
- "return": {
- }
- }
-
-Which is QMP's way of saying "the latest command executed OK and didn't return
-any data". Now you're ready to enter the QMP example commands as explained in
-the following sections.
-
-
-Writing a command that doesn't return data
-------------------------------------------
-
-That's the most simple QMP command that can be written. Usually, this kind of
-command carries some meaningful action in QEMU but here it will just print
-"Hello, world" to the standard output.
-
-Our command will be called "hello-world". It takes no arguments, nor does it
-return any data.
-
-The first step is defining the command in the appropriate QAPI schema
-module. We pick module qapi/misc.json, and add the following line at
-the bottom::
-
- { 'command': 'hello-world' }
-
-The "command" keyword defines a new QMP command. It's an JSON object. All
-schema entries are JSON objects. The line above will instruct the QAPI to
-generate any prototypes and the necessary code to marshal and unmarshal
-protocol data.
-
-The next step is to write the "hello-world" implementation. As explained
-earlier, it's preferable for commands to live in QEMU subsystems. But
-"hello-world" doesn't pertain to any, so we put its implementation in
-monitor/qmp-cmds.c::
-
- void qmp_hello_world(Error **errp)
- {
- printf("Hello, world!\n");
- }
-
-There are a few things to be noticed:
-
-1. QMP command implementation functions must be prefixed with "qmp\_"
-2. qmp_hello_world() returns void, this is in accordance with the fact that the
- command doesn't return any data
-3. It takes an "Error \*\*" argument. This is required. Later we will see how to
- return errors and take additional arguments. The Error argument should not
- be touched if the command doesn't return errors
-4. We won't add the function's prototype. That's automatically done by the QAPI
-5. Printing to the terminal is discouraged for QMP commands, we do it here
- because it's the easiest way to demonstrate a QMP command
-
-You're done. Now build qemu, run it as suggested in the "Testing" section,
-and then type the following QMP command::
-
- { "execute": "hello-world" }
-
-Then check the terminal running qemu and look for the "Hello, world" string. If
-you don't see it then something went wrong.
-
-
-Arguments
-~~~~~~~~~
-
-Let's add an argument called "message" to our "hello-world" command. The new
-argument will contain the string to be printed to stdout. It's an optional
-argument, if it's not present we print our default "Hello, World" string.
-
-The first change we have to do is to modify the command specification in the
-schema file to the following::
-
- { 'command': 'hello-world', 'data': { '*message': 'str' } }
-
-Notice the new 'data' member in the schema. It's an JSON object whose each
-element is an argument to the command in question. Also notice the asterisk,
-it's used to mark the argument optional (that means that you shouldn't use it
-for mandatory arguments). Finally, 'str' is the argument's type, which
-stands for "string". The QAPI also supports integers, booleans, enumerations
-and user defined types.
-
-Now, let's update our C implementation in monitor/qmp-cmds.c::
-
- void qmp_hello_world(bool has_message, const char *message, Error **errp)
- {
- if (has_message) {
- printf("%s\n", message);
- } else {
- printf("Hello, world\n");
- }
- }
-
-There are two important details to be noticed:
-
-1. All optional arguments are accompanied by a 'has\_' boolean, which is set
- if the optional argument is present or false otherwise
-2. The C implementation signature must follow the schema's argument ordering,
- which is defined by the "data" member
-
-Time to test our new version of the "hello-world" command. Build qemu, run it as
-described in the "Testing" section and then send two commands::
-
- { "execute": "hello-world" }
- {
- "return": {
- }
- }
-
- { "execute": "hello-world", "arguments": { "message": "We love qemu" } }
- {
- "return": {
- }
- }
-
-You should see "Hello, world" and "We love qemu" in the terminal running qemu,
-if you don't see these strings, then something went wrong.
-
-
-Errors
-~~~~~~
-
-QMP commands should use the error interface exported by the error.h header
-file. Basically, most errors are set by calling the error_setg() function.
-
-Let's say we don't accept the string "message" to contain the word "love". If
-it does contain it, we want the "hello-world" command to return an error::
-
- void qmp_hello_world(bool has_message, const char *message, Error **errp)
- {
- if (has_message) {
- if (strstr(message, "love")) {
- error_setg(errp, "the word 'love' is not allowed");
- return;
- }
- printf("%s\n", message);
- } else {
- printf("Hello, world\n");
- }
- }
-
-The first argument to the error_setg() function is the Error pointer
-to pointer, which is passed to all QMP functions. The next argument is a human
-description of the error, this is a free-form printf-like string.
-
-Let's test the example above. Build qemu, run it as defined in the "Testing"
-section, and then issue the following command::
-
- { "execute": "hello-world", "arguments": { "message": "all you need is love" } }
-
-The QMP server's response should be::
-
- {
- "error": {
- "class": "GenericError",
- "desc": "the word 'love' is not allowed"
- }
- }
-
-Note that error_setg() produces a "GenericError" class. In general,
-all QMP errors should have that error class. There are two exceptions
-to this rule:
-
- 1. To support a management application's need to recognize a specific
- error for special handling
-
- 2. Backward compatibility
-
-If the failure you want to report falls into one of the two cases above,
-use error_set() with a second argument of an ErrorClass value.
-
-
-Command Documentation
-~~~~~~~~~~~~~~~~~~~~~
-
-There's only one step missing to make "hello-world"'s implementation complete,
-and that's its documentation in the schema file.
-
-There are many examples of such documentation in the schema file already, but
-here goes "hello-world"'s new entry for qapi/misc.json::
-
- ##
- # @hello-world:
- #
- # Print a client provided string to the standard output stream.
- #
- # @message: string to be printed
- #
- # Returns: Nothing on success.
- #
- # Notes: if @message is not provided, the "Hello, world" string will
- # be printed instead
- #
- # Since: <next qemu stable release, eg. 1.0>
- ##
- { 'command': 'hello-world', 'data': { '*message': 'str' } }
-
-Please, note that the "Returns" clause is optional if a command doesn't return
-any data nor any errors.
-
-
-Implementing the HMP command
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Now that the QMP command is in place, we can also make it available in the human
-monitor (HMP).
-
-With the introduction of the QAPI, HMP commands make QMP calls. Most of the
-time HMP commands are simple wrappers. All HMP commands implementation exist in
-the monitor/hmp-cmds.c file.
-
-Here's the implementation of the "hello-world" HMP command::
-
- void hmp_hello_world(Monitor *mon, const QDict *qdict)
- {
- const char *message = qdict_get_try_str(qdict, "message");
- Error *err = NULL;
-
- qmp_hello_world(!!message, message, &err);
- if (err) {
- monitor_printf(mon, "%s\n", error_get_pretty(err));
- error_free(err);
- return;
- }
- }
-
-Also, you have to add the function's prototype to the hmp.h file.
-
-There are three important points to be noticed:
-
-1. The "mon" and "qdict" arguments are mandatory for all HMP functions. The
- former is the monitor object. The latter is how the monitor passes
- arguments entered by the user to the command implementation
-2. hmp_hello_world() performs error checking. In this example we just print
- the error description to the user, but we could do more, like taking
- different actions depending on the error qmp_hello_world() returns
-3. The "err" variable must be initialized to NULL before performing the
- QMP call
-
-There's one last step to actually make the command available to monitor users,
-we should add it to the hmp-commands.hx file::
-
- {
- .name = "hello-world",
- .args_type = "message:s?",
- .params = "hello-world [message]",
- .help = "Print message to the standard output",
- .cmd = hmp_hello_world,
- },
-
-::
-
- STEXI
- @item hello_world @var{message}
- @findex hello_world
- Print message to the standard output
- ETEXI
-
-To test this you have to open a user monitor and issue the "hello-world"
-command. It might be instructive to check the command's documentation with
-HMP's "help" command.
-
-Please, check the "-monitor" command-line option to know how to open a user
-monitor.
-
-
-Writing a command that returns data
------------------------------------
-
-A QMP command is capable of returning any data the QAPI supports like integers,
-strings, booleans, enumerations and user defined types.
-
-In this section we will focus on user defined types. Please, check the QAPI
-documentation for information about the other types.
-
-
-User Defined Types
-~~~~~~~~~~~~~~~~~~
-
-FIXME This example needs to be redone after commit 6d32717
-
-For this example we will write the query-alarm-clock command, which returns
-information about QEMU's timer alarm. For more information about it, please
-check the "-clock" command-line option.
-
-We want to return two pieces of information. The first one is the alarm clock's
-name. The second one is when the next alarm will fire. The former information is
-returned as a string, the latter is an integer in nanoseconds (which is not
-very useful in practice, as the timer has probably already fired when the
-information reaches the client).
-
-The best way to return that data is to create a new QAPI type, as shown below::
-
- ##
- # @QemuAlarmClock
- #
- # QEMU alarm clock information.
- #
- # @clock-name: The alarm clock method's name.
- #
- # @next-deadline: The time (in nanoseconds) the next alarm will fire.
- #
- # Since: 1.0
- ##
- { 'type': 'QemuAlarmClock',
- 'data': { 'clock-name': 'str', '*next-deadline': 'int' } }
-
-The "type" keyword defines a new QAPI type. Its "data" member contains the
-type's members. In this example our members are the "clock-name" and the
-"next-deadline" one, which is optional.
-
-Now let's define the query-alarm-clock command::
-
- ##
- # @query-alarm-clock
- #
- # Return information about QEMU's alarm clock.
- #
- # Returns a @QemuAlarmClock instance describing the alarm clock method
- # being currently used by QEMU (this is usually set by the '-clock'
- # command-line option).
- #
- # Since: 1.0
- ##
- { 'command': 'query-alarm-clock', 'returns': 'QemuAlarmClock' }
-
-Notice the "returns" keyword. As its name suggests, it's used to define the
-data returned by a command.
-
-It's time to implement the qmp_query_alarm_clock() function, you can put it
-in the qemu-timer.c file::
-
- QemuAlarmClock *qmp_query_alarm_clock(Error **errp)
- {
- QemuAlarmClock *clock;
- int64_t deadline;
-
- clock = g_malloc0(sizeof(*clock));
-
- deadline = qemu_next_alarm_deadline();
- if (deadline > 0) {
- clock->has_next_deadline = true;
- clock->next_deadline = deadline;
- }
- clock->clock_name = g_strdup(alarm_timer->name);
-
- return clock;
- }
-
-There are a number of things to be noticed:
-
-1. The QemuAlarmClock type is automatically generated by the QAPI framework,
- its members correspond to the type's specification in the schema file
-2. As specified in the schema file, the function returns a QemuAlarmClock
- instance and takes no arguments (besides the "errp" one, which is mandatory
- for all QMP functions)
-3. The "clock" variable (which will point to our QAPI type instance) is
- allocated by the regular g_malloc0() function. Note that we chose to
- initialize the memory to zero. This is recommended for all QAPI types, as
- it helps avoiding bad surprises (specially with booleans)
-4. Remember that "next_deadline" is optional? All optional members have a
- 'has_TYPE_NAME' member that should be properly set by the implementation,
- as shown above
-5. Even static strings, such as "alarm_timer->name", should be dynamically
- allocated by the implementation. This is so because the QAPI also generates
- a function to free its types and it cannot distinguish between dynamically
- or statically allocated strings
-6. You have to include "qapi/qapi-commands-misc.h" in qemu-timer.c
-
-Time to test the new command. Build qemu, run it as described in the "Testing"
-section and try this::
-
- { "execute": "query-alarm-clock" }
- {
- "return": {
- "next-deadline": 2368219,
- "clock-name": "dynticks"
- }
- }
-
-
-The HMP command
-~~~~~~~~~~~~~~~
-
-Here's the HMP counterpart of the query-alarm-clock command::
-
- void hmp_info_alarm_clock(Monitor *mon)
- {
- QemuAlarmClock *clock;
- Error *err = NULL;
-
- clock = qmp_query_alarm_clock(&err);
- if (err) {
- monitor_printf(mon, "Could not query alarm clock information\n");
- error_free(err);
- return;
- }
-
- monitor_printf(mon, "Alarm clock method in use: '%s'\n", clock->clock_name);
- if (clock->has_next_deadline) {
- monitor_printf(mon, "Next alarm will fire in %" PRId64 " nanoseconds\n",
- clock->next_deadline);
- }
-
- qapi_free_QemuAlarmClock(clock);
- }
-
-It's important to notice that hmp_info_alarm_clock() calls
-qapi_free_QemuAlarmClock() to free the data returned by qmp_query_alarm_clock().
-For user defined types, the QAPI will generate a qapi_free_QAPI_TYPE_NAME()
-function and that's what you have to use to free the types you define and
-qapi_free_QAPI_TYPE_NAMEList() for list types (explained in the next section).
-If the QMP call returns a string, then you should g_free() to free it.
-
-Also note that hmp_info_alarm_clock() performs error handling. That's not
-strictly required if you're sure the QMP function doesn't return errors, but
-it's good practice to always check for errors.
-
-Another important detail is that HMP's "info" commands don't go into the
-hmp-commands.hx. Instead, they go into the info_cmds[] table, which is defined
-in the monitor/misc.c file. The entry for the "info alarmclock" follows::
-
- {
- .name = "alarmclock",
- .args_type = "",
- .params = "",
- .help = "show information about the alarm clock",
- .cmd = hmp_info_alarm_clock,
- },
-
-To test this, run qemu and type "info alarmclock" in the user monitor.
-
-
-Returning Lists
-~~~~~~~~~~~~~~~
-
-For this example, we're going to return all available methods for the timer
-alarm, which is pretty much what the command-line option "-clock ?" does,
-except that we're also going to inform which method is in use.
-
-This first step is to define a new type::
-
- ##
- # @TimerAlarmMethod
- #
- # Timer alarm method information.
- #
- # @method-name: The method's name.
- #
- # @current: true if this alarm method is currently in use, false otherwise
- #
- # Since: 1.0
- ##
- { 'type': 'TimerAlarmMethod',
- 'data': { 'method-name': 'str', 'current': 'bool' } }
-
-The command will be called "query-alarm-methods", here is its schema
-specification::
-
- ##
- # @query-alarm-methods
- #
- # Returns information about available alarm methods.
- #
- # Returns: a list of @TimerAlarmMethod for each method
- #
- # Since: 1.0
- ##
- { 'command': 'query-alarm-methods', 'returns': ['TimerAlarmMethod'] }
-
-Notice the syntax for returning lists "'returns': ['TimerAlarmMethod']", this
-should be read as "returns a list of TimerAlarmMethod instances".
-
-The C implementation follows::
-
- TimerAlarmMethodList *qmp_query_alarm_methods(Error **errp)
- {
- TimerAlarmMethodList *method_list = NULL;
- const struct qemu_alarm_timer *p;
- bool current = true;
-
- for (p = alarm_timers; p->name; p++) {
- TimerAlarmMethod *value = g_malloc0(*value);
- value->method_name = g_strdup(p->name);
- value->current = current;
- QAPI_LIST_PREPEND(method_list, value);
- current = false;
- }
-
- return method_list;
- }
-
-The most important difference from the previous examples is the
-TimerAlarmMethodList type, which is automatically generated by the QAPI from
-the TimerAlarmMethod type.
-
-Each list node is represented by a TimerAlarmMethodList instance. We have to
-allocate it, and that's done inside the for loop: the "info" pointer points to
-an allocated node. We also have to allocate the node's contents, which is
-stored in its "value" member. In our example, the "value" member is a pointer
-to an TimerAlarmMethod instance.
-
-Notice that the "current" variable is used as "true" only in the first
-iteration of the loop. That's because the alarm timer method in use is the
-first element of the alarm_timers array. Also notice that QAPI lists are handled
-by hand and we return the head of the list.
-
-Now Build qemu, run it as explained in the "Testing" section and try our new
-command::
-
- { "execute": "query-alarm-methods" }
- {
- "return": [
- {
- "current": false,
- "method-name": "unix"
- },
- {
- "current": true,
- "method-name": "dynticks"
- }
- ]
- }
-
-The HMP counterpart is a bit more complex than previous examples because it
-has to traverse the list, it's shown below for reference::
-
- void hmp_info_alarm_methods(Monitor *mon)
- {
- TimerAlarmMethodList *method_list, *method;
- Error *err = NULL;
-
- method_list = qmp_query_alarm_methods(&err);
- if (err) {
- monitor_printf(mon, "Could not query alarm methods\n");
- error_free(err);
- return;
- }
-
- for (method = method_list; method; method = method->next) {
- monitor_printf(mon, "%c %s\n", method->value->current ? '*' : ' ',
- method->value->method_name);
- }
-
- qapi_free_TimerAlarmMethodList(method_list);
- }
diff --git a/docs/devel/zoned-storage.rst b/docs/devel/zoned-storage.rst
new file mode 100644
index 0000000000..30296d3c85
--- /dev/null
+++ b/docs/devel/zoned-storage.rst
@@ -0,0 +1,62 @@
+=============
+zoned-storage
+=============
+
+Zoned Block Devices (ZBDs) divide the LBA space into block regions called zones
+that are larger than the LBA size. They can only allow sequential writes, which
+can reduce write amplification in SSDs, and potentially lead to higher
+throughput and increased capacity. More details about ZBDs can be found at:
+
+https://zonedstorage.io/docs/introduction/zoned-storage
+
+1. Block layer APIs for zoned storage
+-------------------------------------
+QEMU block layer supports three zoned storage models:
+- BLK_Z_HM: The host-managed zoned model only allows sequential writes access
+to zones. It supports ZBD-specific I/O commands that can be used by a host to
+manage the zones of a device.
+- BLK_Z_HA: The host-aware zoned model allows random write operations in
+zones, making it backward compatible with regular block devices.
+- BLK_Z_NONE: The non-zoned model has no zones support. It includes both
+regular and drive-managed ZBD devices. ZBD-specific I/O commands are not
+supported.
+
+The block device information resides inside BlockDriverState. QEMU uses
+BlockLimits struct(BlockDriverState::bl) that is continuously accessed by the
+block layer while processing I/O requests. A BlockBackend has a root pointer to
+a BlockDriverState graph(for example, raw format on top of file-posix). The
+zoned storage information can be propagated from the leaf BlockDriverState all
+the way up to the BlockBackend. If the zoned storage model in file-posix is
+set to BLK_Z_HM, then block drivers will declare support for zoned host device.
+
+The block layer APIs support commands needed for zoned storage devices,
+including report zones, four zone operations, and zone append.
+
+2. Emulating zoned storage controllers
+--------------------------------------
+When the BlockBackend's BlockLimits model reports a zoned storage device, users
+like the virtio-blk emulation or the qemu-io-cmds.c utility can use block layer
+APIs for zoned storage emulation or testing.
+
+For example, to test zone_report on a null_blk device using qemu-io is::
+
+ $ path/to/qemu-io --image-opts -n driver=host_device,filename=/dev/nullb0 -c "zrp offset nr_zones"
+
+To expose the host's zoned block device through virtio-blk, the command line
+can be (includes the -device parameter)::
+
+ -blockdev node-name=drive0,driver=host_device,filename=/dev/nullb0,cache.direct=on \
+ -device virtio-blk-pci,drive=drive0
+
+Or only use the -drive parameter::
+
+ -driver driver=host_device,file=/dev/nullb0,if=virtio,cache.direct=on
+
+Additionally, QEMU has several ways of supporting zoned storage, including:
+(1) Using virtio-scsi: --device scsi-block allows for the passing through of
+SCSI ZBC devices, enabling the attachment of ZBC or ZAC HDDs to QEMU.
+(2) PCI device pass-through: While NVMe ZNS emulation is available for testing
+purposes, it cannot yet pass through a zoned device from the host. To pass on
+the NVMe ZNS device to the guest, use VFIO PCI pass the entire NVMe PCI adapter
+through to the guest. Likewise, an HDD HBA can be passed on to QEMU all HDDs
+attached to the HBA.
diff --git a/docs/hyperv.txt b/docs/hyperv.txt
deleted file mode 100644
index 000638a2fd..0000000000
--- a/docs/hyperv.txt
+++ /dev/null
@@ -1,222 +0,0 @@
-Hyper-V Enlightenments
-======================
-
-
-1. Description
-===============
-In some cases when implementing a hardware interface in software is slow, KVM
-implements its own paravirtualized interfaces. This works well for Linux as
-guest support for such features is added simultaneously with the feature itself.
-It may, however, be hard-to-impossible to add support for these interfaces to
-proprietary OSes, namely, Microsoft Windows.
-
-KVM on x86 implements Hyper-V Enlightenments for Windows guests. These features
-make Windows and Hyper-V guests think they're running on top of a Hyper-V
-compatible hypervisor and use Hyper-V specific features.
-
-
-2. Setup
-=========
-No Hyper-V enlightenments are enabled by default by either KVM or QEMU. In
-QEMU, individual enlightenments can be enabled through CPU flags, e.g:
-
- qemu-system-x86_64 --enable-kvm --cpu host,hv_relaxed,hv_vpindex,hv_time, ...
-
-Sometimes there are dependencies between enlightenments, QEMU is supposed to
-check that the supplied configuration is sane.
-
-When any set of the Hyper-V enlightenments is enabled, QEMU changes hypervisor
-identification (CPUID 0x40000000..0x4000000A) to Hyper-V. KVM identification
-and features are kept in leaves 0x40000100..0x40000101.
-
-
-3. Existing enlightenments
-===========================
-
-3.1. hv-relaxed
-================
-This feature tells guest OS to disable watchdog timeouts as it is running on a
-hypervisor. It is known that some Windows versions will do this even when they
-see 'hypervisor' CPU flag.
-
-3.2. hv-vapic
-==============
-Provides so-called VP Assist page MSR to guest allowing it to work with APIC
-more efficiently. In particular, this enlightenment allows paravirtualized
-(exit-less) EOI processing.
-
-3.3. hv-spinlocks=xxx
-======================
-Enables paravirtualized spinlocks. The parameter indicates how many times
-spinlock acquisition should be attempted before indicating the situation to the
-hypervisor. A special value 0xffffffff indicates "never notify".
-
-3.4. hv-vpindex
-================
-Provides HV_X64_MSR_VP_INDEX (0x40000002) MSR to the guest which has Virtual
-processor index information. This enlightenment makes sense in conjunction with
-hv-synic, hv-stimer and other enlightenments which require the guest to know its
-Virtual Processor indices (e.g. when VP index needs to be passed in a
-hypercall).
-
-3.5. hv-runtime
-================
-Provides HV_X64_MSR_VP_RUNTIME (0x40000010) MSR to the guest. The MSR keeps the
-virtual processor run time in 100ns units. This gives guest operating system an
-idea of how much time was 'stolen' from it (when the virtual CPU was preempted
-to perform some other work).
-
-3.6. hv-crash
-==============
-Provides HV_X64_MSR_CRASH_P0..HV_X64_MSR_CRASH_P5 (0x40000100..0x40000105) and
-HV_X64_MSR_CRASH_CTL (0x40000105) MSRs to the guest. These MSRs are written to
-by the guest when it crashes, HV_X64_MSR_CRASH_P0..HV_X64_MSR_CRASH_P5 MSRs
-contain additional crash information. This information is outputted in QEMU log
-and through QAPI.
-Note: unlike under genuine Hyper-V, write to HV_X64_MSR_CRASH_CTL causes guest
-to shutdown. This effectively blocks crash dump generation by Windows.
-
-3.7. hv-time
-=============
-Enables two Hyper-V-specific clocksources available to the guest: MSR-based
-Hyper-V clocksource (HV_X64_MSR_TIME_REF_COUNT, 0x40000020) and Reference TSC
-page (enabled via MSR HV_X64_MSR_REFERENCE_TSC, 0x40000021). Both clocksources
-are per-guest, Reference TSC page clocksource allows for exit-less time stamp
-readings. Using this enlightenment leads to significant speedup of all timestamp
-related operations.
-
-3.8. hv-synic
-==============
-Enables Hyper-V Synthetic interrupt controller - an extension of a local APIC.
-When enabled, this enlightenment provides additional communication facilities
-to the guest: SynIC messages and Events. This is a pre-requisite for
-implementing VMBus devices (not yet in QEMU). Additionally, this enlightenment
-is needed to enable Hyper-V synthetic timers. SynIC is controlled through MSRs
-HV_X64_MSR_SCONTROL..HV_X64_MSR_EOM (0x40000080..0x40000084) and
-HV_X64_MSR_SINT0..HV_X64_MSR_SINT15 (0x40000090..0x4000009F)
-
-Requires: hv-vpindex
-
-3.9. hv-stimer
-===============
-Enables Hyper-V synthetic timers. There are four synthetic timers per virtual
-CPU controlled through HV_X64_MSR_STIMER0_CONFIG..HV_X64_MSR_STIMER3_COUNT
-(0x400000B0..0x400000B7) MSRs. These timers can work either in single-shot or
-periodic mode. It is known that certain Windows versions revert to using HPET
-(or even RTC when HPET is unavailable) extensively when this enlightenment is
-not provided; this can lead to significant CPU consumption, even when virtual
-CPU is idle.
-
-Requires: hv-vpindex, hv-synic, hv-time
-
-3.10. hv-tlbflush
-==================
-Enables paravirtualized TLB shoot-down mechanism. On x86 architecture, remote
-TLB flush procedure requires sending IPIs and waiting for other CPUs to perform
-local TLB flush. In virtualized environment some virtual CPUs may not even be
-scheduled at the time of the call and may not require flushing (or, flushing
-may be postponed until the virtual CPU is scheduled). hv-tlbflush enlightenment
-implements TLB shoot-down through hypervisor enabling the optimization.
-
-Requires: hv-vpindex
-
-3.11. hv-ipi
-=============
-Enables paravirtualized IPI send mechanism. HvCallSendSyntheticClusterIpi
-hypercall may target more than 64 virtual CPUs simultaneously, doing the same
-through APIC requires more than one access (and thus exit to the hypervisor).
-
-Requires: hv-vpindex
-
-3.12. hv-vendor-id=xxx
-=======================
-This changes Hyper-V identification in CPUID 0x40000000.EBX-EDX from the default
-"Microsoft Hv". The parameter should be no longer than 12 characters. According
-to the specification, guests shouldn't use this information and it is unknown
-if there is a Windows version which acts differently.
-Note: hv-vendor-id is not an enlightenment and thus doesn't enable Hyper-V
-identification when specified without some other enlightenment.
-
-3.13. hv-reset
-===============
-Provides HV_X64_MSR_RESET (0x40000003) MSR to the guest allowing it to reset
-itself by writing to it. Even when this MSR is enabled, it is not a recommended
-way for Windows to perform system reboot and thus it may not be used.
-
-3.14. hv-frequencies
-============================================
-Provides HV_X64_MSR_TSC_FREQUENCY (0x40000022) and HV_X64_MSR_APIC_FREQUENCY
-(0x40000023) allowing the guest to get its TSC/APIC frequencies without doing
-measurements.
-
-3.15 hv-reenlightenment
-========================
-The enlightenment is nested specific, it targets Hyper-V on KVM guests. When
-enabled, it provides HV_X64_MSR_REENLIGHTENMENT_CONTROL (0x40000106),
-HV_X64_MSR_TSC_EMULATION_CONTROL (0x40000107)and HV_X64_MSR_TSC_EMULATION_STATUS
-(0x40000108) MSRs allowing the guest to get notified when TSC frequency changes
-(only happens on migration) and keep using old frequency (through emulation in
-the hypervisor) until it is ready to switch to the new one. This, in conjunction
-with hv-frequencies, allows Hyper-V on KVM to pass stable clocksource (Reference
-TSC page) to its own guests.
-
-Note, KVM doesn't fully support re-enlightenment notifications and doesn't
-emulate TSC accesses after migration so 'tsc-frequency=' CPU option also has to
-be specified to make migration succeed. The destination host has to either have
-the same TSC frequency or support TSC scaling CPU feature.
-
-Recommended: hv-frequencies
-
-3.16. hv-evmcs
-===============
-The enlightenment is nested specific, it targets Hyper-V on KVM guests. When
-enabled, it provides Enlightened VMCS version 1 feature to the guest. The feature
-implements paravirtualized protocol between L0 (KVM) and L1 (Hyper-V)
-hypervisors making L2 exits to the hypervisor faster. The feature is Intel-only.
-Note: some virtualization features (e.g. Posted Interrupts) are disabled when
-hv-evmcs is enabled. It may make sense to measure your nested workload with and
-without the feature to find out if enabling it is beneficial.
-
-Requires: hv-vapic
-
-3.17. hv-stimer-direct
-=======================
-Hyper-V specification allows synthetic timer operation in two modes: "classic",
-when expiration event is delivered as SynIC message and "direct", when the event
-is delivered via normal interrupt. It is known that nested Hyper-V can only
-use synthetic timers in direct mode and thus 'hv-stimer-direct' needs to be
-enabled.
-
-Requires: hv-vpindex, hv-synic, hv-time, hv-stimer
-
-3.17. hv-no-nonarch-coresharing=on/off/auto
-===========================================
-This enlightenment tells guest OS that virtual processors will never share a
-physical core unless they are reported as sibling SMT threads. This information
-is required by Windows and Hyper-V guests to properly mitigate SMT related CPU
-vulnerabilities.
-When the option is set to 'auto' QEMU will enable the feature only when KVM
-reports that non-architectural coresharing is impossible, this means that
-hyper-threading is not supported or completely disabled on the host. This
-setting also prevents migration as SMT settings on the destination may differ.
-When the option is set to 'on' QEMU will always enable the feature, regardless
-of host setup. To keep guests secure, this can only be used in conjunction with
-exposing correct vCPU topology and vCPU pinning.
-
-4. Development features
-========================
-In some cases (e.g. during development) it may make sense to use QEMU in
-'pass-through' mode and give Windows guests all enlightenments currently
-supported by KVM. This pass-through mode is enabled by "hv-passthrough" CPU
-flag.
-Note: "hv-passthrough" flag only enables enlightenments which are known to QEMU
-(have corresponding "hv-*" flag) and copies "hv-spinlocks="/"hv-vendor-id="
-values from KVM to QEMU. "hv-passthrough" overrides all other "hv-*" settings on
-the command line. Also, enabling this flag effectively prevents migration as the
-list of enabled enlightenments may differ between target and destination hosts.
-
-
-4. Useful links
-================
-Hyper-V Top Level Functional specification and other information:
-https://github.com/MicrosoftDocs/Virtualization-Documentation
diff --git a/docs/image-fuzzer.txt b/docs/image-fuzzer.txt
index 3e23ebec33..279cc8c807 100644
--- a/docs/image-fuzzer.txt
+++ b/docs/image-fuzzer.txt
@@ -51,10 +51,10 @@ assumes that core dumps will be generated in the current working directory.
For comprehensive test results, please, set up your test environment
properly.
-Paths to binaries under test (SUTs) qemu-img and qemu-io are retrieved from
-environment variables. If the environment check fails the runner will
+Paths to binaries under test (SUTs) ``qemu-img`` and ``qemu-io`` are retrieved
+from environment variables. If the environment check fails the runner will
use SUTs installed in system paths.
-qemu-img is required for creation of backing files, so it's mandatory to set
+``qemu-img`` is required for creation of backing files, so it's mandatory to set
the related environment variable if it's not installed in the system path.
For details about environment variables see qemu-iotests/check.
diff --git a/docs/interop/bitmaps.rst b/docs/interop/bitmaps.rst
index 059ad67929..ddf8947d54 100644
--- a/docs/interop/bitmaps.rst
+++ b/docs/interop/bitmaps.rst
@@ -166,9 +166,9 @@ Basic QMP Usage
---------------
The primary interface to manipulating bitmap objects is via the QMP
-interface. If you are not familiar, see docs/interop/qmp-intro.txt for a broad
-overview, and `qemu-qmp-ref <qemu-qmp-ref.html>`_ for a full reference of all
-QMP commands.
+interface. If you are not familiar, see the :doc:`qmp-spec` for the
+protocol, and :doc:`qemu-qmp-ref` for a full reference of all QMP
+commands.
Supported Commands
~~~~~~~~~~~~~~~~~~
@@ -539,12 +539,11 @@ other partial disk images on top of a base image to reconstruct a full backup
from the point in time at which the incremental backup was issued.
The "Push Model" here references the fact that QEMU is "pushing" the modified
-blocks out to a destination. We will be using the `drive-backup
-<qemu-qmp-ref.html#index-drive_002dbackup>`_ and `blockdev-backup
-<qemu-qmp-ref.html#index-blockdev_002dbackup>`_ QMP commands to create both
+blocks out to a destination. We will be using the `blockdev-backup
+<qemu-qmp-ref.html#index-blockdev_002dbackup>`_ QMP command to create both
full and incremental backups.
-Both of these commands are jobs, which have their own QMP API for querying and
+The command is a background job, which has its own QMP API for querying and
management documented in `Background jobs
<qemu-qmp-ref.html#Background-jobs>`_.
@@ -557,6 +556,10 @@ create a new incremental backup chain attached to a drive.
This example creates a new, full backup of "drive0" and accompanies it with a
new, empty bitmap that records writes from this point in time forward.
+The target can be created with the help of `blockdev-add
+<qemu-qmp-ref.html#index-blockdev_002dadd>`_ or `blockdev-create
+<qemu-qmp-ref.html#index-blockdev_002dcreate>`_ command.
+
.. note:: Any new writes that happen after this command is issued, even while
the backup job runs, will be written locally and not to the backup
destination. These writes will be recorded in the bitmap
@@ -576,12 +579,11 @@ new, empty bitmap that records writes from this point in time forward.
}
},
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive0",
- "target": "/path/to/drive0.full.qcow2",
- "sync": "full",
- "format": "qcow2"
+ "target": "target0",
+ "sync": "full"
}
}
]
@@ -664,12 +666,11 @@ use a transaction to reset the bitmap while making a new full backup:
}
},
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive0",
- "target": "/path/to/drive0.new_full.qcow2",
- "sync": "full",
- "format": "qcow2"
+ "target": "target0",
+ "sync": "full"
}
}
]
@@ -728,19 +729,35 @@ Example: First Incremental Backup
$ qemu-img create -f qcow2 drive0.inc0.qcow2 \
-b drive0.full.qcow2 -F qcow2
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
#. Issue an incremental backup command:
.. code-block:: QMP
-> {
- "execute": "drive-backup",
+ "execute": "blockdev-backup",
"arguments": {
"device": "drive0",
"bitmap": "bitmap0",
- "target": "drive0.inc0.qcow2",
- "format": "qcow2",
- "sync": "incremental",
- "mode": "existing"
+ "target": "target0",
+ "sync": "incremental"
}
}
@@ -785,20 +802,36 @@ Example: Second Incremental Backup
$ qemu-img create -f qcow2 drive0.inc1.qcow2 \
-b drive0.inc0.qcow2 -F qcow2
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc1.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
#. Issue a new incremental backup command. The only difference here is that we
have changed the target image below.
.. code-block:: QMP
-> {
- "execute": "drive-backup",
+ "execute": "blockdev-backup",
"arguments": {
"device": "drive0",
"bitmap": "bitmap0",
- "target": "drive0.inc1.qcow2",
- "format": "qcow2",
- "sync": "incremental",
- "mode": "existing"
+ "target": "target0",
+ "sync": "incremental"
}
}
@@ -866,20 +899,36 @@ image:
file for you, but you lose control over format options like
compatibility and preallocation presets.
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc2.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
#. Issue a new incremental backup command. Apart from the new destination
image, there is no difference from the last two examples.
.. code-block:: QMP
-> {
- "execute": "drive-backup",
+ "execute": "blockdev-backup",
"arguments": {
"device": "drive0",
"bitmap": "bitmap0",
- "target": "drive0.inc2.qcow2",
- "format": "qcow2",
- "sync": "incremental",
- "mode": "existing"
+ "target": "target0",
+ "sync": "incremental"
}
}
@@ -930,6 +979,38 @@ point in time.
$ qemu-img create -f qcow2 drive0.full.qcow2 64G
$ qemu-img create -f qcow2 drive1.full.qcow2 64G
+#. Add target block nodes:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.full.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target1",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive1.full.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
#. Create a full (anchor) backup for each drive, with accompanying bitmaps:
.. code-block:: QMP
@@ -953,21 +1034,19 @@ point in time.
}
},
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive0",
- "target": "/path/to/drive0.full.qcow2",
- "sync": "full",
- "format": "qcow2"
+ "target": "target0",
+ "sync": "full"
}
},
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive1",
- "target": "/path/to/drive1.full.qcow2",
- "sync": "full",
- "format": "qcow2"
+ "target": "target1",
+ "sync": "full"
}
}
]
@@ -1016,6 +1095,38 @@ point in time.
$ qemu-img create -f qcow2 drive1.inc0.qcow2 \
-b drive1.full.qcow2 -F qcow2
+#. Add target block nodes:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target1",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive1.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
#. Issue a multi-drive incremental push backup transaction:
.. code-block:: QMP
@@ -1025,25 +1136,21 @@ point in time.
"arguments": {
"actions": [
{
- "type": "drive-backup",
+ "type": "blockev-backup",
"data": {
"device": "drive0",
"bitmap": "bitmap0",
- "format": "qcow2",
- "mode": "existing",
"sync": "incremental",
- "target": "drive0.inc0.qcow2"
+ "target": "target0"
}
},
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive1",
"bitmap": "bitmap0",
- "format": "qcow2",
- "mode": "existing",
"sync": "incremental",
- "target": "drive1.inc0.qcow2"
+ "target": "target1"
}
},
]
@@ -1119,19 +1226,35 @@ described above. This example demonstrates the single-job failure case:
$ qemu-img create -f qcow2 drive0.inc0.qcow2 \
-b drive0.full.qcow2 -F qcow2
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
#. Attempt to create an incremental backup via QMP:
.. code-block:: QMP
-> {
- "execute": "drive-backup",
+ "execute": "blockdev-backup",
"arguments": {
"device": "drive0",
"bitmap": "bitmap0",
- "target": "drive0.inc0.qcow2",
- "format": "qcow2",
- "sync": "incremental",
- "mode": "existing"
+ "target": "target0",
+ "sync": "incremental"
}
}
@@ -1164,6 +1287,19 @@ described above. This example demonstrates the single-job failure case:
"event": "BLOCK_JOB_COMPLETED"
}
+#. Remove target node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-del",
+ "arguments": {
+ "node-name": "target0",
+ }
+ }
+
+ <- { "return": {} }
+
#. Delete the failed image, and re-create it.
.. code:: bash
@@ -1172,20 +1308,36 @@ described above. This example demonstrates the single-job failure case:
$ qemu-img create -f qcow2 drive0.inc0.qcow2 \
-b drive0.full.qcow2 -F qcow2
+#. Add target block node:
+
+ .. code-block:: QMP
+
+ -> {
+ "execute": "blockdev-add",
+ "arguments": {
+ "node-name": "target0",
+ "driver": "qcow2",
+ "file": {
+ "driver": "file",
+ "filename": "drive0.inc0.qcow2"
+ }
+ }
+ }
+
+ <- { "return": {} }
+
#. Retry the command after fixing the underlying problem, such as
freeing up space on the backup volume:
.. code-block:: QMP
-> {
- "execute": "drive-backup",
+ "execute": "blockdev-backup",
"arguments": {
"device": "drive0",
"bitmap": "bitmap0",
- "target": "drive0.inc0.qcow2",
- "format": "qcow2",
- "sync": "incremental",
- "mode": "existing"
+ "target": "target0",
+ "sync": "incremental"
}
}
@@ -1210,7 +1362,8 @@ described above. This example demonstrates the single-job failure case:
Example: Partial Transactional Failures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-QMP commands like `drive-backup <qemu-qmp-ref.html#index-drive_002dbackup>`_
+QMP commands like `blockdev-backup
+<qemu-qmp-ref.html#index-blockdev_002dbackup>`_
conceptually only start a job, and so transactions containing these commands
may succeed even if the job it created later fails. This might have surprising
interactions with notions of how a "transaction" ought to behave.
@@ -1240,25 +1393,21 @@ and one succeeds:
"arguments": {
"actions": [
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive0",
"bitmap": "bitmap0",
- "format": "qcow2",
- "mode": "existing",
"sync": "incremental",
- "target": "drive0.inc0.qcow2"
+ "target": "target0"
}
},
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive1",
"bitmap": "bitmap0",
- "format": "qcow2",
- "mode": "existing",
"sync": "incremental",
- "target": "drive1.inc0.qcow2"
+ "target": "target1"
}
}]
}
@@ -1375,25 +1524,21 @@ applied:
},
"actions": [
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive0",
"bitmap": "bitmap0",
- "format": "qcow2",
- "mode": "existing",
"sync": "incremental",
- "target": "drive0.inc0.qcow2"
+ "target": "target0"
}
},
{
- "type": "drive-backup",
+ "type": "blockdev-backup",
"data": {
"device": "drive1",
"bitmap": "bitmap0",
- "format": "qcow2",
- "mode": "existing",
"sync": "incremental",
- "target": "drive1.inc0.qcow2"
+ "target": "target1"
}
}]
}
diff --git a/docs/interop/dbus-display.rst b/docs/interop/dbus-display.rst
new file mode 100644
index 0000000000..8c6e8e0f5a
--- /dev/null
+++ b/docs/interop/dbus-display.rst
@@ -0,0 +1,31 @@
+D-Bus display
+=============
+
+QEMU can export the VM display through D-Bus (when started with ``-display
+dbus``), to allow out-of-process UIs, remote protocol servers or other
+interactive display usages.
+
+Various specialized D-Bus interfaces are available on different object paths
+under ``/org/qemu/Display1/``, depending on the VM configuration.
+
+QEMU also implements the standard interfaces, such as
+`org.freedesktop.DBus.Introspectable
+<https://dbus.freedesktop.org/doc/dbus-specification.html#standard-interfaces>`_.
+
+.. contents::
+ :local:
+ :depth: 1
+
+.. only:: sphinx4
+
+ .. dbus-doc:: ui/dbus-display1.xml
+
+.. only:: not sphinx4
+
+ .. warning::
+ Sphinx 4 is required to build D-Bus documentation.
+
+ This is the content of ``ui/dbus-display1.xml``:
+
+ .. literalinclude:: ../../ui/dbus-display1.xml
+ :language: xml
diff --git a/docs/interop/dbus-vmstate.rst b/docs/interop/dbus-vmstate.rst
index 1d719c1c60..5fb3f279e2 100644
--- a/docs/interop/dbus-vmstate.rst
+++ b/docs/interop/dbus-vmstate.rst
@@ -2,9 +2,6 @@
D-Bus VMState
=============
-Introduction
-============
-
The QEMU dbus-vmstate object's aim is to migrate helpers' data running
on a QEMU D-Bus bus. (refer to the :doc:`dbus` document for
some recommendations on D-Bus usage)
@@ -26,49 +23,16 @@ dbus-vmstate object can be configured with the expected list of
helpers by setting its ``id-list`` property, with a comma-separated
``Id`` list.
-Interface
-=========
-
-On object path ``/org/qemu/VMState1``, the following
-``org.qemu.VMState1`` interface should be implemented:
-
-.. code:: xml
-
- <interface name="org.qemu.VMState1">
- <property name="Id" type="s" access="read"/>
- <method name="Load">
- <arg type="ay" name="data" direction="in"/>
- </method>
- <method name="Save">
- <arg type="ay" name="data" direction="out"/>
- </method>
- </interface>
-
-"Id" property
--------------
-
-A string that identifies the helper uniquely. (maximum 256 bytes
-including terminating NUL byte)
-
-.. note::
-
- The helper ID namespace is a separate namespace. In particular, it is not
- related to QEMU "id" used in -object/-device objects.
-
-Load(in u8[] bytes) method
---------------------------
-
-The method called on destination with the state to restore.
+.. only:: sphinx4
-The helper may be initially started in a waiting state (with
-an --incoming argument for example), and it may resume on success.
+ .. dbus-doc:: backends/dbus-vmstate1.xml
-An error may be returned to the caller.
+.. only:: not sphinx4
-Save(out u8[] bytes) method
----------------------------
+ .. warning::
+ Sphinx 4 is required to build D-Bus documentation.
-The method called on the source to get the current state to be
-migrated. The helper should continue to run normally.
+ This is the content of ``backends/dbus-vmstate1.xml``:
-An error may be returned to the caller.
+ .. literalinclude:: ../../backends/dbus-vmstate1.xml
+ :language: xml
diff --git a/docs/interop/dbus.rst b/docs/interop/dbus.rst
index be596d3f41..427debc9c5 100644
--- a/docs/interop/dbus.rst
+++ b/docs/interop/dbus.rst
@@ -108,3 +108,5 @@ QEMU Interfaces
===============
:doc:`dbus-vmstate`
+
+:doc:`dbus-display`
diff --git a/docs/interop/firmware.json b/docs/interop/firmware.json
index 8d8b0be030..54a1fc6c10 100644
--- a/docs/interop/firmware.json
+++ b/docs/interop/firmware.json
@@ -113,13 +113,22 @@
# Virtualization, as specified in the AMD64 Architecture
# Programmer's Manual. QEMU command line options related to
# this feature are documented in
-# "docs/amd-memory-encryption.txt".
+# "docs/system/i386/amd-memory-encryption.rst".
#
# @amd-sev-es: The firmware supports running under AMD Secure Encrypted
# Virtualization - Encrypted State, as specified in the AMD64
# Architecture Programmer's Manual. QEMU command line options
# related to this feature are documented in
-# "docs/amd-memory-encryption.txt".
+# "docs/system/i386/amd-memory-encryption.rst".
+#
+# @amd-sev-snp: The firmware supports running under AMD Secure Encrypted
+# Virtualization - Secure Nested Paging, as specified in the
+# AMD64 Architecture Programmer's Manual. QEMU command line
+# options related to this feature are documented in
+# "docs/system/i386/amd-memory-encryption.rst".
+#
+# @intel-tdx: The firmware supports running under Intel Trust Domain
+# Extensions (TDX).
#
# @enrolled-keys: The variable store (NVRAM) template associated with
# the firmware binary has the UEFI Secure Boot
@@ -185,9 +194,11 @@
# Since: 3.0
##
{ 'enum' : 'FirmwareFeature',
- 'data' : [ 'acpi-s3', 'acpi-s4', 'amd-sev', 'amd-sev-es', 'enrolled-keys',
- 'requires-smm', 'secure-boot', 'verbose-dynamic',
- 'verbose-static' ] }
+ 'data' : [ 'acpi-s3', 'acpi-s4',
+ 'amd-sev', 'amd-sev-es', 'amd-sev-snp',
+ 'intel-tdx',
+ 'enrolled-keys', 'requires-smm', 'secure-boot',
+ 'verbose-dynamic', 'verbose-static' ] }
##
# @FirmwareFlashFile:
@@ -210,24 +221,61 @@
'data' : { 'filename' : 'str',
'format' : 'BlockdevDriver' } }
+
+##
+# @FirmwareFlashMode:
+#
+# Describes how the firmware build handles code versus variable
+# persistence.
+#
+# @split: the executable file contains code while the NVRAM
+# template provides variable storage. The executable
+# must be configured read-only and can be shared between
+# multiple guests. The NVRAM template must be cloned
+# for each new guest and configured read-write.
+#
+# @combined: the executable file contains both code and
+# variable storage. The executable must be cloned
+# for each new guest and configured read-write.
+# No NVRAM template will be specified.
+#
+# @stateless: the executable file contains code and variable
+# storage is not persisted. The executable must
+# be configured read-only and can be shared
+# between multiple guests. No NVRAM template
+# will be specified.
+#
+# Since: 7.0.0
+##
+{ 'enum': 'FirmwareFlashMode',
+ 'data': [ 'split', 'combined', 'stateless' ] }
+
##
# @FirmwareMappingFlash:
#
# Describes loading and mapping properties for the firmware executable
# and its accompanying NVRAM file, when @FirmwareDevice is @flash.
#
-# @executable: Identifies the firmware executable. The firmware
-# executable may be shared by multiple virtual machine
-# definitions. The preferred corresponding QEMU command
-# line options are
+# @mode: Describes how the firmware build handles code versus variable
+# storage. If not present, it must be treated as if it was
+# configured with value @split. Since: 7.0.0
+#
+# @executable: Identifies the firmware executable. The @mode
+# indicates whether there will be an associated
+# NVRAM template present. The preferred
+# corresponding QEMU command line options are
# -drive if=none,id=pflash0,readonly=on,file=@executable.@filename,format=@executable.@format
# -machine pflash0=pflash0
-# or equivalent -blockdev instead of -drive.
+# or equivalent -blockdev instead of -drive. When
+# @mode is @combined the executable must be
+# cloned before use and configured with readonly=off.
# With QEMU versions older than 4.0, you have to use
# -drive if=pflash,unit=0,readonly=on,file=@executable.@filename,format=@executable.@format
#
# @nvram-template: Identifies the NVRAM template compatible with
-# @executable. Management software instantiates an
+# @executable, when @mode is set to @split,
+# otherwise it should not be present.
+# Management software instantiates an
# individual copy -- a specific NVRAM file -- from
# @nvram-template.@filename for each new virtual
# machine definition created. @nvram-template.@filename
@@ -246,8 +294,9 @@
# Since: 3.0
##
{ 'struct' : 'FirmwareMappingFlash',
- 'data' : { 'executable' : 'FirmwareFlashFile',
- 'nvram-template' : 'FirmwareFlashFile' } }
+ 'data' : { '*mode': 'FirmwareFlashMode',
+ 'executable' : 'FirmwareFlashFile',
+ '*nvram-template' : 'FirmwareFlashFile' } }
##
# @FirmwareMappingKernel:
@@ -386,203 +435,203 @@
#
# Examples:
#
-# {
-# "description": "SeaBIOS",
-# "interface-types": [
-# "bios"
-# ],
-# "mapping": {
-# "device": "memory",
-# "filename": "/usr/share/seabios/bios-256k.bin"
-# },
-# "targets": [
-# {
-# "architecture": "i386",
-# "machines": [
-# "pc-i440fx-*",
-# "pc-q35-*"
-# ]
+# {
+# "description": "SeaBIOS",
+# "interface-types": [
+# "bios"
+# ],
+# "mapping": {
+# "device": "memory",
+# "filename": "/usr/share/seabios/bios-256k.bin"
# },
-# {
-# "architecture": "x86_64",
-# "machines": [
-# "pc-i440fx-*",
-# "pc-q35-*"
-# ]
-# }
-# ],
-# "features": [
-# "acpi-s3",
-# "acpi-s4"
-# ],
-# "tags": [
-# "CONFIG_BOOTSPLASH=n",
-# "CONFIG_ROM_SIZE=256",
-# "CONFIG_USE_SMM=n"
-# ]
-# }
-#
-# {
-# "description": "OVMF with SB+SMM, empty varstore",
-# "interface-types": [
-# "uefi"
-# ],
-# "mapping": {
-# "device": "flash",
-# "executable": {
-# "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd",
-# "format": "raw"
+# "targets": [
+# {
+# "architecture": "i386",
+# "machines": [
+# "pc-i440fx-*",
+# "pc-q35-*"
+# ]
+# },
+# {
+# "architecture": "x86_64",
+# "machines": [
+# "pc-i440fx-*",
+# "pc-q35-*"
+# ]
+# }
+# ],
+# "features": [
+# "acpi-s3",
+# "acpi-s4"
+# ],
+# "tags": [
+# "CONFIG_BOOTSPLASH=n",
+# "CONFIG_ROM_SIZE=256",
+# "CONFIG_USE_SMM=n"
+# ]
+# }
+#
+# {
+# "description": "OVMF with SB+SMM, empty varstore",
+# "interface-types": [
+# "uefi"
+# ],
+# "mapping": {
+# "device": "flash",
+# "executable": {
+# "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd",
+# "format": "raw"
+# },
+# "nvram-template": {
+# "filename": "/usr/share/OVMF/OVMF_VARS.fd",
+# "format": "raw"
+# }
# },
-# "nvram-template": {
-# "filename": "/usr/share/OVMF/OVMF_VARS.fd",
-# "format": "raw"
-# }
-# },
-# "targets": [
-# {
-# "architecture": "x86_64",
-# "machines": [
-# "pc-q35-*"
-# ]
-# }
-# ],
-# "features": [
-# "acpi-s3",
-# "amd-sev",
-# "requires-smm",
-# "secure-boot",
-# "verbose-dynamic"
-# ],
-# "tags": [
-# "-a IA32",
-# "-a X64",
-# "-p OvmfPkg/OvmfPkgIa32X64.dsc",
-# "-t GCC48",
-# "-b DEBUG",
-# "-D SMM_REQUIRE",
-# "-D SECURE_BOOT_ENABLE",
-# "-D FD_SIZE_4MB"
-# ]
-# }
-#
-# {
-# "description": "OVMF with SB+SMM, SB enabled, MS certs enrolled",
-# "interface-types": [
-# "uefi"
-# ],
-# "mapping": {
-# "device": "flash",
-# "executable": {
-# "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd",
-# "format": "raw"
+# "targets": [
+# {
+# "architecture": "x86_64",
+# "machines": [
+# "pc-q35-*"
+# ]
+# }
+# ],
+# "features": [
+# "acpi-s3",
+# "amd-sev",
+# "requires-smm",
+# "secure-boot",
+# "verbose-dynamic"
+# ],
+# "tags": [
+# "-a IA32",
+# "-a X64",
+# "-p OvmfPkg/OvmfPkgIa32X64.dsc",
+# "-t GCC48",
+# "-b DEBUG",
+# "-D SMM_REQUIRE",
+# "-D SECURE_BOOT_ENABLE",
+# "-D FD_SIZE_4MB"
+# ]
+# }
+#
+# {
+# "description": "OVMF with SB+SMM, SB enabled, MS certs enrolled",
+# "interface-types": [
+# "uefi"
+# ],
+# "mapping": {
+# "device": "flash",
+# "executable": {
+# "filename": "/usr/share/OVMF/OVMF_CODE.secboot.fd",
+# "format": "raw"
+# },
+# "nvram-template": {
+# "filename": "/usr/share/OVMF/OVMF_VARS.secboot.fd",
+# "format": "raw"
+# }
# },
-# "nvram-template": {
-# "filename": "/usr/share/OVMF/OVMF_VARS.secboot.fd",
-# "format": "raw"
-# }
-# },
-# "targets": [
-# {
-# "architecture": "x86_64",
-# "machines": [
-# "pc-q35-*"
-# ]
-# }
-# ],
-# "features": [
-# "acpi-s3",
-# "amd-sev",
-# "enrolled-keys",
-# "requires-smm",
-# "secure-boot",
-# "verbose-dynamic"
-# ],
-# "tags": [
-# "-a IA32",
-# "-a X64",
-# "-p OvmfPkg/OvmfPkgIa32X64.dsc",
-# "-t GCC48",
-# "-b DEBUG",
-# "-D SMM_REQUIRE",
-# "-D SECURE_BOOT_ENABLE",
-# "-D FD_SIZE_4MB"
-# ]
-# }
-#
-# {
-# "description": "OVMF with SEV-ES support",
-# "interface-types": [
-# "uefi"
-# ],
-# "mapping": {
-# "device": "flash",
-# "executable": {
-# "filename": "/usr/share/OVMF/OVMF_CODE.fd",
-# "format": "raw"
+# "targets": [
+# {
+# "architecture": "x86_64",
+# "machines": [
+# "pc-q35-*"
+# ]
+# }
+# ],
+# "features": [
+# "acpi-s3",
+# "amd-sev",
+# "enrolled-keys",
+# "requires-smm",
+# "secure-boot",
+# "verbose-dynamic"
+# ],
+# "tags": [
+# "-a IA32",
+# "-a X64",
+# "-p OvmfPkg/OvmfPkgIa32X64.dsc",
+# "-t GCC48",
+# "-b DEBUG",
+# "-D SMM_REQUIRE",
+# "-D SECURE_BOOT_ENABLE",
+# "-D FD_SIZE_4MB"
+# ]
+# }
+#
+# {
+# "description": "OVMF with SEV-ES support",
+# "interface-types": [
+# "uefi"
+# ],
+# "mapping": {
+# "device": "flash",
+# "executable": {
+# "filename": "/usr/share/OVMF/OVMF_CODE.fd",
+# "format": "raw"
+# },
+# "nvram-template": {
+# "filename": "/usr/share/OVMF/OVMF_VARS.fd",
+# "format": "raw"
+# }
# },
-# "nvram-template": {
-# "filename": "/usr/share/OVMF/OVMF_VARS.fd",
-# "format": "raw"
-# }
-# },
-# "targets": [
-# {
-# "architecture": "x86_64",
-# "machines": [
-# "pc-q35-*"
-# ]
-# }
-# ],
-# "features": [
-# "acpi-s3",
-# "amd-sev",
-# "amd-sev-es",
-# "verbose-dynamic"
-# ],
-# "tags": [
-# "-a X64",
-# "-p OvmfPkg/OvmfPkgX64.dsc",
-# "-t GCC48",
-# "-b DEBUG",
-# "-D FD_SIZE_4MB"
-# ]
-# }
-#
-# {
-# "description": "UEFI firmware for ARM64 virtual machines",
-# "interface-types": [
-# "uefi"
-# ],
-# "mapping": {
-# "device": "flash",
-# "executable": {
-# "filename": "/usr/share/AAVMF/AAVMF_CODE.fd",
-# "format": "raw"
+# "targets": [
+# {
+# "architecture": "x86_64",
+# "machines": [
+# "pc-q35-*"
+# ]
+# }
+# ],
+# "features": [
+# "acpi-s3",
+# "amd-sev",
+# "amd-sev-es",
+# "verbose-dynamic"
+# ],
+# "tags": [
+# "-a X64",
+# "-p OvmfPkg/OvmfPkgX64.dsc",
+# "-t GCC48",
+# "-b DEBUG",
+# "-D FD_SIZE_4MB"
+# ]
+# }
+#
+# {
+# "description": "UEFI firmware for ARM64 virtual machines",
+# "interface-types": [
+# "uefi"
+# ],
+# "mapping": {
+# "device": "flash",
+# "executable": {
+# "filename": "/usr/share/AAVMF/AAVMF_CODE.fd",
+# "format": "raw"
+# },
+# "nvram-template": {
+# "filename": "/usr/share/AAVMF/AAVMF_VARS.fd",
+# "format": "raw"
+# }
# },
-# "nvram-template": {
-# "filename": "/usr/share/AAVMF/AAVMF_VARS.fd",
-# "format": "raw"
-# }
-# },
-# "targets": [
-# {
-# "architecture": "aarch64",
-# "machines": [
-# "virt-*"
-# ]
-# }
-# ],
-# "features": [
-#
-# ],
-# "tags": [
-# "-a AARCH64",
-# "-p ArmVirtPkg/ArmVirtQemu.dsc",
-# "-t GCC48",
-# "-b DEBUG",
-# "-D DEBUG_PRINT_ERROR_LEVEL=0x80000000"
-# ]
-# }
+# "targets": [
+# {
+# "architecture": "aarch64",
+# "machines": [
+# "virt-*"
+# ]
+# }
+# ],
+# "features": [
+#
+# ],
+# "tags": [
+# "-a AARCH64",
+# "-p ArmVirtPkg/ArmVirtQemu.dsc",
+# "-t GCC48",
+# "-b DEBUG",
+# "-D DEBUG_PRINT_ERROR_LEVEL=0x80000000"
+# ]
+# }
##
{ 'struct' : 'Firmware',
'data' : { 'description' : 'str',
diff --git a/docs/interop/index.rst b/docs/interop/index.rst
index 47b9ed82bb..ed65395bfb 100644
--- a/docs/interop/index.rst
+++ b/docs/interop/index.rst
@@ -12,8 +12,10 @@ are useful for making QEMU interoperate with other software.
bitmaps
dbus
dbus-vmstate
+ dbus-display
live-block-operations
pr-helper
+ qmp-spec
qemu-ga
qemu-ga-ref
qemu-qmp-ref
@@ -21,3 +23,5 @@ are useful for making QEMU interoperate with other software.
vhost-user
vhost-user-gpu
vhost-vdpa
+ virtio-balloon-stats
+ vnc-ledstate-pseudo-encoding
diff --git a/docs/interop/live-block-operations.rst b/docs/interop/live-block-operations.rst
index 9e3635b233..691429c7af 100644
--- a/docs/interop/live-block-operations.rst
+++ b/docs/interop/live-block-operations.rst
@@ -4,6 +4,8 @@
This work is licensed under the terms of the GNU GPL, version 2 or
later. See the COPYING file in the top-level directory.
+.. _Live Block Operations:
+
============================
Live Block Device Operations
============================
@@ -53,7 +55,7 @@ files in a disk image backing chain:
(1) Directional: 'base' and 'top'. Given the simple disk image chain
above, image [A] can be referred to as 'base', and image [B] as
- 'top'. (This terminology can be seen in in QAPI schema file,
+ 'top'. (This terminology can be seen in the QAPI schema file,
block-core.json.)
(2) Relational: 'backing file' and 'overlay'. Again, taking the same
@@ -116,8 +118,8 @@ QEMU block layer supports.
(3) ``drive-mirror`` (and ``blockdev-mirror``): Synchronize a running
disk to another image.
-(4) ``drive-backup`` (and ``blockdev-backup``): Point-in-time (live) copy
- of a block device to a destination.
+(4) ``blockdev-backup`` (and the deprecated ``drive-backup``):
+ Point-in-time (live) copy of a block device to a destination.
.. _`Interacting with a QEMU instance`:
@@ -555,13 +557,14 @@ Currently, there are four different kinds:
(3) ``none`` -- Synchronize only the new writes from this point on.
- .. note:: In the case of ``drive-backup`` (or ``blockdev-backup``),
- the behavior of ``none`` synchronization mode is different.
- Normally, a ``backup`` job consists of two parts: Anything
- that is overwritten by the guest is first copied out to
- the backup, and in the background the whole image is
- copied from start to end. With ``sync=none``, it's only
- the first part.
+ .. note:: In the case of ``blockdev-backup`` (or deprecated
+ ``drive-backup``), the behavior of ``none``
+ synchronization mode is different. Normally, a
+ ``backup`` job consists of two parts: Anything that is
+ overwritten by the guest is first copied out to the
+ backup, and in the background the whole image is copied
+ from start to end. With ``sync=none``, it's only the
+ first part.
(4) ``incremental`` -- Synchronize content that is described by the
dirty bitmap
@@ -640,7 +643,7 @@ at this point:
(QEMU) block-job-complete device=job0
In either of the above cases, if you once again run the
-`query-block-jobs` command, there should not be any active block
+``query-block-jobs`` command, there should not be any active block
operation.
Comparing 'commit' and 'mirror': In both then cases, the overlay images
@@ -824,7 +827,7 @@ entire disk image chain, to a target, using ``blockdev-mirror`` would be:
job ready to be completed
(5) Gracefully complete the 'mirror' block device job, and notice the
- the event ``BLOCK_JOB_COMPLETED``
+ event ``BLOCK_JOB_COMPLETED``
(6) Shutdown the guest by issuing the QMP ``quit`` command so that
caches are flushed
@@ -928,19 +931,22 @@ Shutdown the guest, by issuing the ``quit`` QMP command::
}
-Live disk backup --- ``drive-backup`` and ``blockdev-backup``
--------------------------------------------------------------
+Live disk backup --- ``blockdev-backup`` and the deprecated``drive-backup``
+---------------------------------------------------------------------------
-The ``drive-backup`` (and its newer equivalent ``blockdev-backup``) allows
+The ``blockdev-backup`` (and the deprecated ``drive-backup``) allows
you to create a point-in-time snapshot.
-In this case, the point-in-time is when you *start* the ``drive-backup``
-(or its newer equivalent ``blockdev-backup``) command.
+In this case, the point-in-time is when you *start* the
+``blockdev-backup`` (or deprecated ``drive-backup``) command.
QMP invocation for ``drive-backup``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Note that ``drive-backup`` command is deprecated since QEMU 6.2 and
+will be removed in future.
+
Yet again, starting afresh with our example disk image chain::
[A] <-- [B] <-- [C] <-- [D]
@@ -965,11 +971,22 @@ will be issued, indicating the live block device job operation has
completed, and no further action is required.
+Moving from the deprecated ``drive-backup`` to newer ``blockdev-backup``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+``blockdev-backup`` differs from ``drive-backup`` in how you specify
+the backup target. With ``blockdev-backup`` you can't specify filename
+as a target. Instead you use ``node-name`` of existing block node,
+which you may add by ``blockdev-add`` or ``blockdev-create`` commands.
+Correspondingly, ``blockdev-backup`` doesn't have ``mode`` and
+``format`` arguments which don't apply to an existing block node. See
+following sections for details and examples.
+
+
Notes on ``blockdev-backup``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-The ``blockdev-backup`` command is equivalent in functionality to
-``drive-backup``, except that it operates at node-level in a Block Driver
+The ``blockdev-backup`` command operates at node-level in a Block Driver
State (BDS) graph.
E.g. the sequence of actions to create a point-in-time backup
diff --git a/docs/interop/nbd.txt b/docs/interop/nbd.txt
index 10ce098a29..18efb251de 100644
--- a/docs/interop/nbd.txt
+++ b/docs/interop/nbd.txt
@@ -1,4 +1,4 @@
-Qemu supports the NBD protocol, and has an internal NBD client (see
+QEMU supports the NBD protocol, and has an internal NBD client (see
block/nbd.c), an internal NBD server (see blockdev-nbd.c), and an
external NBD server tool (see qemu-nbd.c). The common code is placed
in nbd/*.
@@ -7,11 +7,11 @@ The NBD protocol is specified here:
https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md
The following paragraphs describe some specific properties of NBD
-protocol realization in Qemu.
+protocol realization in QEMU.
= Metadata namespaces =
-Qemu supports the "base:allocation" metadata context as defined in the
+QEMU supports the "base:allocation" metadata context as defined in the
NBD protocol specification, and also defines an additional metadata
namespace "qemu".
@@ -68,3 +68,5 @@ NBD_CMD_BLOCK_STATUS for "qemu:dirty-bitmap:", NBD_CMD_CACHE
* 4.2: NBD_FLAG_CAN_MULTI_CONN for shareable read-only exports,
NBD_CMD_FLAG_FAST_ZERO
* 5.2: NBD_CMD_BLOCK_STATUS for "qemu:allocation-depth"
+* 7.1: NBD_FLAG_CAN_MULTI_CONN for shareable writable exports
+* 8.2: NBD_OPT_EXTENDED_HEADERS, NBD_FLAG_BLOCK_STATUS_PAYLOAD
diff --git a/docs/interop/prl-xml.txt b/docs/interop/prl-xml.txt
index 7031f8752c..cf9b3fba26 100644
--- a/docs/interop/prl-xml.txt
+++ b/docs/interop/prl-xml.txt
@@ -122,7 +122,7 @@ Each Image element has following child elements:
* Type - image type of the element. It can be:
"Plain" for raw files.
"Compressed" for expanding disks.
- * File - path to image file. Path can be relative to DiskDecriptor.xml or
+ * File - path to image file. Path can be relative to DiskDescriptor.xml or
absolute.
== Snapshots element ==
diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index 0463f761ef..2c4618375a 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -214,14 +214,19 @@ version 2.
type.
If the incompatible bit "Compression type" is set: the field
- must be present and non-zero (which means non-zlib
+ must be present and non-zero (which means non-deflate
compression type). Otherwise, this field must not be present
- or must be zero (which means zlib).
+ or must be zero (which means deflate).
Available compression type values:
- 0: zlib <https://www.zlib.net/>
+ 0: deflate <https://www.ietf.org/rfc/rfc1951.txt>
1: zstd <http://github.com/facebook/zstd>
+ The deflate compression type is called "zlib"
+ <https://www.zlib.net/> in QEMU. However, clusters with the
+ deflate compression type do not have zlib headers.
+
+ 105 - 111: Padding, contents defined below.
=== Header padding ===
@@ -313,7 +318,7 @@ The fields of the bitmaps extension are:
The number of bitmaps contained in the image. Must be
greater than or equal to 1.
- Note: Qemu currently only supports up to 65535 bitmaps per
+ Note: QEMU currently only supports up to 65535 bitmaps per
image.
4 - 7: Reserved, must be zero.
@@ -775,7 +780,7 @@ Structure of a bitmap directory entry:
2: extra_data_compatible
This flags is meaningful when the extra data is
unknown to the software (currently any extra data is
- unknown to Qemu).
+ unknown to QEMU).
If it is set, the bitmap may be used as expected, extra
data must be left as is.
If it is not set, the bitmap must not be used, but
@@ -793,7 +798,7 @@ Structure of a bitmap directory entry:
17: granularity_bits
Granularity bits. Valid values: 0 - 63.
- Note: Qemu currently supports only values 9 - 31.
+ Note: QEMU currently supports only values 9 - 31.
Granularity is calculated as
granularity = 1 << granularity_bits
@@ -804,7 +809,7 @@ Structure of a bitmap directory entry:
18 - 19: name_size
Size of the bitmap name. Must be non-zero.
- Note: Qemu currently doesn't support values greater than
+ Note: QEMU currently doesn't support values greater than
1023.
20 - 23: extra_data_size
diff --git a/docs/interop/qemu-ga.rst b/docs/interop/qemu-ga.rst
index 3063357bb5..72fb75a6f5 100644
--- a/docs/interop/qemu-ga.rst
+++ b/docs/interop/qemu-ga.rst
@@ -79,10 +79,15 @@ Options
Daemonize after startup (detach from terminal).
-.. option:: -b, --blacklist=LIST
+.. option:: -b, --block-rpcs=LIST
- Comma-separated list of RPCs to disable (no spaces, ``?`` to list
- available RPCs).
+ Comma-separated list of RPCs to disable (no spaces, use ``--block-rpcs=help``
+ to list available RPCs).
+
+.. option:: -a, --allow-rpcs=LIST
+
+ Comma-separated list of RPCs to enable (no spaces, use ``--allow-rpcs=help``
+ to list available RPCs).
.. option:: -D, --dump-conf
@@ -125,7 +130,7 @@ pidfile string
fsfreeze-hook string
statedir string
verbose boolean
-blacklist string list
+block-rpcs string list
============= ===========
See also
diff --git a/docs/interop/qemu-qmp-ref.rst b/docs/interop/qemu-qmp-ref.rst
index 357effd64f..f94614a0b2 100644
--- a/docs/interop/qemu-qmp-ref.rst
+++ b/docs/interop/qemu-qmp-ref.rst
@@ -1,3 +1,5 @@
+.. _QMP Ref:
+
QEMU QMP Reference Manual
=========================
diff --git a/docs/interop/qmp-intro.txt b/docs/interop/qmp-intro.txt
deleted file mode 100644
index 1c745a7af0..0000000000
--- a/docs/interop/qmp-intro.txt
+++ /dev/null
@@ -1,88 +0,0 @@
- QEMU Machine Protocol
- =====================
-
-Introduction
-------------
-
-The QEMU Machine Protocol (QMP) allows applications to operate a
-QEMU instance.
-
-QMP is JSON[1] based and features the following:
-
-- Lightweight, text-based, easy to parse data format
-- Asynchronous messages support (ie. events)
-- Capabilities Negotiation
-
-For detailed information on QMP's usage, please, refer to the following files:
-
-o qmp-spec.txt QEMU Machine Protocol current specification
-o qemu-qmp-ref.html QEMU QMP commands and events (auto-generated at build-time)
-
-[1] https://www.json.org
-
-Usage
------
-
-You can use the -qmp option to enable QMP. For example, the following
-makes QMP available on localhost port 4444:
-
-$ qemu [...] -qmp tcp:localhost:4444,server=on,wait=off
-
-However, for more flexibility and to make use of more options, the -mon
-command-line option should be used. For instance, the following example
-creates one HMP instance (human monitor) on stdio and one QMP instance
-on localhost port 4444:
-
-$ qemu [...] -chardev stdio,id=mon0 -mon chardev=mon0,mode=readline \
- -chardev socket,id=mon1,host=localhost,port=4444,server=on,wait=off \
- -mon chardev=mon1,mode=control,pretty=on
-
-Please, refer to QEMU's manpage for more information.
-
-Simple Testing
---------------
-
-To manually test QMP one can connect with telnet and issue commands by hand:
-
-$ telnet localhost 4444
-Trying 127.0.0.1...
-Connected to localhost.
-Escape character is '^]'.
-{
- "QMP": {
- "version": {
- "qemu": {
- "micro": 0,
- "minor": 0,
- "major": 3
- },
- "package": "v3.0.0"
- },
- "capabilities": [
- "oob"
- ]
- }
-}
-
-{ "execute": "qmp_capabilities" }
-{
- "return": {
- }
-}
-
-{ "execute": "query-status" }
-{
- "return": {
- "status": "prelaunch",
- "singlestep": false,
- "running": false
- }
-}
-
-Please refer to docs/interop/qemu-qmp-ref.* for a complete command
-reference, generated from qapi/qapi-schema.json.
-
-QMP wiki page
--------------
-
-https://wiki.qemu.org/QMP
diff --git a/docs/interop/qmp-spec.txt b/docs/interop/qmp-spec.rst
index b0e8351d5b..563344160e 100644
--- a/docs/interop/qmp-spec.txt
+++ b/docs/interop/qmp-spec.rst
@@ -1,24 +1,26 @@
- QEMU Machine Protocol Specification
+..
+ Copyright (C) 2009-2016 Red Hat, Inc.
-0. About This Document
-======================
-
-Copyright (C) 2009-2016 Red Hat, Inc.
+ This work is licensed under the terms of the GNU GPL, version 2 or
+ later. See the COPYING file in the top-level directory.
-This work is licensed under the terms of the GNU GPL, version 2 or
-later. See the COPYING file in the top-level directory.
-1. Introduction
-===============
+===================================
+QEMU Machine Protocol Specification
+===================================
-This document specifies the QEMU Machine Protocol (QMP), a JSON-based
+The QEMU Machine Protocol (QMP) is a JSON-based
protocol which is available for applications to operate QEMU at the
machine-level. It is also in use by the QEMU Guest Agent (QGA), which
is available for host applications to interact with the guest
-operating system.
+operating system. This page specifies the general format of
+the protocol; details of the commands and data structures can
+be found in the :doc:`qemu-qmp-ref` and the :doc:`qemu-ga-ref`.
-2. Protocol Specification
-=========================
+.. contents::
+
+Protocol Specification
+======================
This section details the protocol format. For the purpose of this
document, "Server" is either QEMU or the QEMU Guest Agent, and
@@ -30,9 +32,7 @@ following format:
json-DATA-STRUCTURE-NAME
Where DATA-STRUCTURE-NAME is any valid JSON data structure, as defined
-by the JSON standard:
-
-http://www.ietf.org/rfc/rfc8259.txt
+by the `JSON standard <http://www.ietf.org/rfc/rfc8259.txt>`_.
The server expects its input to be encoded in UTF-8, and sends its
output encoded in ASCII.
@@ -45,83 +45,89 @@ important unless specifically documented otherwise. Repeating a key
within a json-object gives unpredictable results.
Also for convenience, the server will accept an extension of
-'single-quoted' strings in place of the usual "double-quoted"
+``'single-quoted'`` strings in place of the usual ``"double-quoted"``
json-string, and both input forms of strings understand an additional
-escape sequence of "\'" for a single quote. The server will only use
+escape sequence of ``\'`` for a single quote. The server will only use
double quoting on output.
-2.1 General Definitions
------------------------
+General Definitions
+-------------------
-2.1.1 All interactions transmitted by the Server are json-objects, always
- terminating with CRLF
+All interactions transmitted by the Server are json-objects, always
+terminating with CRLF.
-2.1.2 All json-objects members are mandatory when not specified otherwise
+All json-objects members are mandatory when not specified otherwise.
-2.2 Server Greeting
--------------------
+Server Greeting
+---------------
Right when connected the Server will issue a greeting message, which signals
that the connection has been successfully established and that the Server is
ready for capabilities negotiation (for more information refer to section
-'4. Capabilities Negotiation').
+`Capabilities Negotiation`_).
The greeting message format is:
-{ "QMP": { "version": json-object, "capabilities": json-array } }
+::
+
+ { "QMP": { "version": json-object, "capabilities": json-array } }
- Where,
+Where:
-- The "version" member contains the Server's version information (the format
- is the same of the query-version command)
-- The "capabilities" member specify the availability of features beyond the
+- The ``version`` member contains the Server's version information (the format
+ is the same as for the query-version command).
+- The ``capabilities`` member specifies the availability of features beyond the
baseline specification; the order of elements in this array has no
particular significance.
-2.2.1 Capabilities
-------------------
+Capabilities
+------------
Currently supported capabilities are:
-- "oob": the QMP server supports "out-of-band" (OOB) command
- execution, as described in section "2.3.1 Out-of-band execution".
+``oob``
+ the QMP server supports "out-of-band" (OOB) command
+ execution, as described in section `Out-of-band execution`_.
-2.3 Issuing Commands
---------------------
+Issuing Commands
+----------------
The format for command execution is:
-{ "execute": json-string, "arguments": json-object, "id": json-value }
+::
+
+ { "execute": json-string, "arguments": json-object, "id": json-value }
or
-{ "exec-oob": json-string, "arguments": json-object, "id": json-value }
+::
- Where,
+ { "exec-oob": json-string, "arguments": json-object, "id": json-value }
-- The "execute" or "exec-oob" member identifies the command to be
+Where:
+
+- The ``execute`` or ``exec-oob`` member identifies the command to be
executed by the server. The latter requests out-of-band execution.
-- The "arguments" member is used to pass any arguments required for the
+- The ``arguments`` member is used to pass any arguments required for the
execution of the command, it is optional when no arguments are
required. Each command documents what contents will be considered
- valid when handling the json-argument
-- The "id" member is a transaction identification associated with the
+ valid when handling the json-argument.
+- The ``id`` member is a transaction identification associated with the
command execution, it is optional and will be part of the response
- if provided. The "id" member can be any json-value. A json-number
+ if provided. The ``id`` member can be any json-value. A json-number
incremented for each successive command works fine.
-The actual commands are documented in the QEMU QMP reference manual
-docs/interop/qemu-qmp-ref.{7,html,info,pdf,txt}.
+The actual commands are documented in the :doc:`qemu-qmp-ref`.
-2.3.1 Out-of-band execution
----------------------------
+Out-of-band execution
+---------------------
The server normally reads, executes and responds to one command after
the other. The client therefore receives command responses in issue
order.
-With out-of-band execution enabled via capability negotiation (section
-4.), the server reads and queues commands as they arrive. It executes
+With out-of-band execution enabled via `capabilities negotiation`_,
+the server reads and queues commands as they arrive. It executes
commands from the queue one after the other. Commands executed
out-of-band jump the queue: the command get executed right away,
possibly overtaking prior in-band commands. The client may therefore
@@ -129,8 +135,8 @@ receive such a command's response before responses from prior in-band
commands.
To be able to match responses back to their commands, the client needs
-to pass "id" with out-of-band commands. Passing it with all commands
-is recommended for clients that accept capability "oob".
+to pass ``id`` with out-of-band commands. Passing it with all commands
+is recommended for clients that accept capability ``oob``.
If the client sends in-band commands faster than the server can
execute them, the server will stop reading requests until the request
@@ -140,57 +146,61 @@ To ensure commands to be executed out-of-band get read and executed,
the client should have at most eight in-band commands in flight.
Only a few commands support out-of-band execution. The ones that do
-have "allow-oob": true in output of query-qmp-schema.
+have ``"allow-oob": true`` in the output of ``query-qmp-schema``.
-2.4 Commands Responses
-----------------------
+Commands Responses
+------------------
There are two possible responses which the Server will issue as the result
of a command execution: success or error.
-As long as the commands were issued with a proper "id" field, then the
-same "id" field will be attached in the corresponding response message
+As long as the commands were issued with a proper ``id`` field, then the
+same ``id`` field will be attached in the corresponding response message
so that requests and responses can match. Clients should drop all the
-responses that have an unknown "id" field.
+responses that have an unknown ``id`` field.
-2.4.1 success
--------------
+Success
+-------
The format of a success response is:
-{ "return": json-value, "id": json-value }
+::
+
+ { "return": json-value, "id": json-value }
- Where,
+Where:
-- The "return" member contains the data returned by the command, which
+- The ``return`` member contains the data returned by the command, which
is defined on a per-command basis (usually a json-object or
json-array of json-objects, but sometimes a json-number, json-string,
or json-array of json-strings); it is an empty json-object if the
- command does not return data
-- The "id" member contains the transaction identification associated
- with the command execution if issued by the Client
+ command does not return data.
+- The ``id`` member contains the transaction identification associated
+ with the command execution if issued by the Client.
-2.4.2 error
------------
+Error
+-----
The format of an error response is:
-{ "error": { "class": json-string, "desc": json-string }, "id": json-value }
+::
- Where,
+ { "error": { "class": json-string, "desc": json-string }, "id": json-value }
-- The "class" member contains the error class name (eg. "GenericError")
-- The "desc" member is a human-readable error message. Clients should
+Where:
+
+- The ``class`` member contains the error class name (eg. ``"GenericError"``).
+- The ``desc`` member is a human-readable error message. Clients should
not attempt to parse this message.
-- The "id" member contains the transaction identification associated with
- the command execution if issued by the Client
+- The ``id`` member contains the transaction identification associated with
+ the command execution if issued by the Client.
-NOTE: Some errors can occur before the Server is able to read the "id" member,
-in these cases the "id" member will not be part of the error response, even
+NOTE: Some errors can occur before the Server is able to read the ``id`` member;
+in these cases the ``id`` member will not be part of the error response, even
if provided by the client.
-2.5 Asynchronous events
------------------------
+Asynchronous events
+-------------------
As a result of state changes, the Server may send messages unilaterally
to the Client at any time, when not in the middle of any other
@@ -198,44 +208,45 @@ response. They are called "asynchronous events".
The format of asynchronous events is:
-{ "event": json-string, "data": json-object,
- "timestamp": { "seconds": json-number, "microseconds": json-number } }
+::
- Where,
+ { "event": json-string, "data": json-object,
+ "timestamp": { "seconds": json-number, "microseconds": json-number } }
-- The "event" member contains the event's name
-- The "data" member contains event specific data, which is defined in a
- per-event basis, it is optional
-- The "timestamp" member contains the exact time of when the event
+Where:
+
+- The ``event`` member contains the event's name.
+- The ``data`` member contains event specific data, which is defined in a
+ per-event basis. It is optional.
+- The ``timestamp`` member contains the exact time of when the event
occurred in the Server. It is a fixed json-object with time in
seconds and microseconds relative to the Unix Epoch (1 Jan 1970); if
there is a failure to retrieve host time, both members of the
timestamp will be set to -1.
-The actual asynchronous events are documented in the QEMU QMP
-reference manual docs/interop/qemu-qmp-ref.{7,html,info,pdf,txt}.
+The actual asynchronous events are documented in the :doc:`qemu-qmp-ref`.
Some events are rate-limited to at most one per second. If additional
"similar" events arrive within one second, all but the last one are
dropped, and the last one is delayed. "Similar" normally means same
event type.
-2.6 Forcing the JSON parser into known-good state
--------------------------------------------------
+Forcing the JSON parser into known-good state
+---------------------------------------------
Incomplete or invalid input can leave the server's JSON parser in a
state where it can't parse additional commands. To get it back into
known-good state, the client should provoke a lexical error.
The cleanest way to do that is sending an ASCII control character
-other than '\t' (horizontal tab), '\r' (carriage return), or '\n' (new
-line).
+other than ``\t`` (horizontal tab), ``\r`` (carriage return), or
+``\n`` (new line).
Sadly, older versions of QEMU can fail to flag this as an error. If a
client needs to deal with them, it should send a 0xFF byte.
-2.7 QGA Synchronization
------------------------
+QGA Synchronization
+-------------------
When a client connects to QGA over a transport lacking proper
connection semantics such as virtio-serial, QGA may have read partial
@@ -243,86 +254,106 @@ input from a previous client. The client needs to force QGA's parser
into known-good state using the previous section's technique.
Moreover, the client may receive output a previous client didn't read.
To help with skipping that output, QGA provides the
-'guest-sync-delimited' command. Refer to its documentation for
+``guest-sync-delimited`` command. Refer to its documentation for
details.
-3. QMP Examples
-===============
+QMP Examples
+============
This section provides some examples of real QMP usage, in all of them
-"C" stands for "Client" and "S" stands for "Server".
+``->`` marks text sent by the Client and ``<-`` marks replies by the Server.
-3.1 Server greeting
--------------------
+.. admonition:: Example
-S: { "QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 3},
- "package": "v3.0.0"}, "capabilities": ["oob"] } }
+ Server greeting
-3.2 Capabilities negotiation
-----------------------------
+ .. code-block:: QMP
-C: { "execute": "qmp_capabilities", "arguments": { "enable": ["oob"] } }
-S: { "return": {}}
+ <- { "QMP": {"version": {"qemu": {"micro": 0, "minor": 0, "major": 3},
+ "package": "v3.0.0"}, "capabilities": ["oob"] } }
-3.3 Simple 'stop' execution
----------------------------
+.. admonition:: Example
-C: { "execute": "stop" }
-S: { "return": {} }
+ Capabilities negotiation
-3.4 KVM information
--------------------
+ .. code-block:: QMP
-C: { "execute": "query-kvm", "id": "example" }
-S: { "return": { "enabled": true, "present": true }, "id": "example"}
+ -> { "execute": "qmp_capabilities", "arguments": { "enable": ["oob"] } }
+ <- { "return": {}}
-3.5 Parsing error
-------------------
+.. admonition:: Example
-C: { "execute": }
-S: { "error": { "class": "GenericError", "desc": "Invalid JSON syntax" } }
+ Simple 'stop' execution
-3.6 Powerdown event
--------------------
+ .. code-block:: QMP
-S: { "timestamp": { "seconds": 1258551470, "microseconds": 802384 },
- "event": "POWERDOWN" }
+ -> { "execute": "stop" }
+ <- { "return": {} }
-3.7 Out-of-band execution
--------------------------
+.. admonition:: Example
-C: { "exec-oob": "migrate-pause", "id": 42 }
-S: { "id": 42,
- "error": { "class": "GenericError",
- "desc": "migrate-pause is currently only supported during postcopy-active state" } }
+ KVM information
+ .. code-block:: QMP
-4. Capabilities Negotiation
-===========================
+ -> { "execute": "query-kvm", "id": "example" }
+ <- { "return": { "enabled": true, "present": true }, "id": "example"}
+
+.. admonition:: Example
+
+ Parsing error
+
+ .. code-block:: QMP
+
+ -> { "execute": }
+ <- { "error": { "class": "GenericError", "desc": "JSON parse error, expecting value" } }
+
+.. admonition:: Example
+
+ Powerdown event
+
+ .. code-block:: QMP
+
+ <- { "timestamp": { "seconds": 1258551470, "microseconds": 802384 },
+ "event": "POWERDOWN" }
+
+.. admonition:: Example
+
+ Out-of-band execution
+
+ .. code-block:: QMP
+
+ -> { "exec-oob": "migrate-pause", "id": 42 }
+ <- { "id": 42,
+ "error": { "class": "GenericError",
+ "desc": "migrate-pause is currently only supported during postcopy-active state" } }
+
+
+Capabilities Negotiation
+========================
When a Client successfully establishes a connection, the Server is in
Capabilities Negotiation mode.
-In this mode only the qmp_capabilities command is allowed to run, all
-other commands will return the CommandNotFound error. Asynchronous
+In this mode only the ``qmp_capabilities`` command is allowed to run; all
+other commands will return the ``CommandNotFound`` error. Asynchronous
messages are not delivered either.
-Clients should use the qmp_capabilities command to enable capabilities
-advertised in the Server's greeting (section '2.2 Server Greeting') they
-support.
+Clients should use the ``qmp_capabilities`` command to enable capabilities
+advertised in the `Server Greeting`_ which they support.
-When the qmp_capabilities command is issued, and if it does not return an
-error, the Server enters in Command mode where capabilities changes take
-effect, all commands (except qmp_capabilities) are allowed and asynchronous
+When the ``qmp_capabilities`` command is issued, and if it does not return an
+error, the Server enters Command mode where capabilities changes take
+effect, all commands (except ``qmp_capabilities``) are allowed and asynchronous
messages are delivered.
-5 Compatibility Considerations
-==============================
+Compatibility Considerations
+============================
All protocol changes or new features which modify the protocol format in an
incompatible way are disabled by default and will be advertised by the
-capabilities array (section '2.2 Server Greeting'). Thus, Clients can check
+capabilities array (in the `Server Greeting`_). Thus, Clients can check
that array and enable the capabilities they support.
The QMP Server performs a type check on the arguments to a command. It
@@ -337,12 +368,12 @@ However, Clients must not assume any particular:
- Length of json-arrays
- Size of json-objects; in particular, future versions of QEMU may add
- new keys and Clients should be able to ignore them.
+ new keys and Clients should be able to ignore them
- Order of json-object members or json-array elements
- Amount of errors generated by a command, that is, new errors can be added
to any existing command in newer versions of the Server
-Any command or member name beginning with "x-" is deemed experimental,
+Any command or member name beginning with ``x-`` is deemed experimental,
and may be withdrawn or changed in an incompatible manner in a future
release.
@@ -350,8 +381,8 @@ Of course, the Server does guarantee to send valid JSON. But apart from
this, a Client should be "conservative in what they send, and liberal in
what they accept".
-6. Downstream extension of QMP
-==============================
+Downstream extension of QMP
+===========================
We recommend that downstream consumers of QEMU do *not* modify QMP.
Management tools should be able to support both upstream and downstream
@@ -363,23 +394,25 @@ avoid modifying QMP. Both upstream and downstream need to take care to
preserve long-term compatibility and interoperability.
To help with that, QMP reserves JSON object member names beginning with
-'__' (double underscore) for downstream use ("downstream names"). This
+``__`` (double underscore) for downstream use ("downstream names"). This
means upstream will never use any downstream names for its commands,
arguments, errors, asynchronous events, and so forth.
-Any new names downstream wishes to add must begin with '__'. To
+Any new names downstream wishes to add must begin with ``__``. To
ensure compatibility with other downstreams, it is strongly
-recommended that you prefix your downstream names with '__RFQDN_' where
+recommended that you prefix your downstream names with ``__RFQDN_`` where
RFQDN is a valid, reverse fully qualified domain name which you
control. For example, a qemu-kvm specific monitor command would be:
+::
+
(qemu) __org.linux-kvm_enable_irqchip
-Downstream must not change the server greeting (section 2.2) other than
+Downstream must not change the `server greeting`_ other than
to offer additional capabilities. But see below for why even that is
discouraged.
-Section '5 Compatibility Considerations' applies to downstream as well
+The section `Compatibility Considerations`_ applies to downstream as well
as to upstream, obviously. It follows that downstream must behave
exactly like upstream for any input not containing members with
downstream names ("downstream members"), except it may add members
diff --git a/docs/interop/vhost-user-gpu.rst b/docs/interop/vhost-user-gpu.rst
index 71a2c52b31..3035822d05 100644
--- a/docs/interop/vhost-user-gpu.rst
+++ b/docs/interop/vhost-user-gpu.rst
@@ -13,10 +13,10 @@ Introduction
============
The vhost-user-gpu protocol is aiming at sharing the rendering result
-of a virtio-gpu, done from a vhost-user slave process to a vhost-user
-master process (such as QEMU). It bears a resemblance to a display
+of a virtio-gpu, done from a vhost-user back-end process to a vhost-user
+front-end process (such as QEMU). It bears a resemblance to a display
server protocol, if you consider QEMU as the display server and the
-slave as the client, but in a very limited way. Typically, it will
+back-end as the client, but in a very limited way. Typically, it will
work by setting a scanout/display configuration, before sending flush
events for the display updates. It will also update the cursor shape
and position.
@@ -26,8 +26,8 @@ socket ancillary data to share opened file descriptors (DMABUF fds or
shared memory). The socket is usually obtained via
``VHOST_USER_GPU_SET_SOCKET``.
-Requests are sent by the *slave*, and the optional replies by the
-*master*.
+Requests are sent by the *back-end*, and the optional replies by the
+*front-end*.
Wire format
===========
@@ -124,6 +124,29 @@ VhostUserGpuDMABUFScanout
:fourcc: ``i32``, the DMABUF fourcc
+VhostUserGpuEdidRequest
+^^^^^^^^^^^^^^^^^^^^^^^
+
++------------+
+| scanout-id |
++------------+
+
+:scanout-id: ``u32``, the scanout to get edid from
+
+
+VhostUserGpuDMABUFScanout2
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++----------------+----------+
+| dmabuf_scanout | modifier |
++----------------+----------+
+
+:dmabuf_scanout: ``VhostUserGpuDMABUFScanout``, filled as described in the
+ VhostUserGpuDMABUFScanout structure.
+
+:modifier: ``u64``, the DMABUF modifiers
+
+
C structure
-----------
@@ -141,6 +164,8 @@ In QEMU the vhost-user-gpu message is implemented with the following struct:
VhostUserGpuScanout scanout;
VhostUserGpuUpdate update;
VhostUserGpuDMABUFScanout dmabuf_scanout;
+ VhostUserGpuEdidRequest edid_req;
+ struct virtio_gpu_resp_edid resp_edid;
struct virtio_gpu_resp_display_info display_info;
uint64_t u64;
} payload;
@@ -149,10 +174,12 @@ In QEMU the vhost-user-gpu message is implemented with the following struct:
Protocol features
-----------------
-None yet.
+.. code:: c
+
+ #define VHOST_USER_GPU_PROTOCOL_F_EDID 0
+ #define VHOST_USER_GPU_PROTOCOL_F_DMABUF2 1
-As the protocol may need to evolve, new messages and communication
-changes are negotiated thanks to preliminary
+New messages and communication changes are negotiated thanks to the
``VHOST_USER_GPU_GET_PROTOCOL_FEATURES`` and
``VHOST_USER_GPU_SET_PROTOCOL_FEATURES`` requests.
@@ -241,3 +268,22 @@ Message types
Note: there is no data payload, since the scanout is shared thanks
to DMABUF, that must have been set previously with
``VHOST_USER_GPU_DMABUF_SCANOUT``.
+
+``VHOST_USER_GPU_GET_EDID``
+ :id: 11
+ :request payload: ``struct VhostUserGpuEdidRequest``
+ :reply payload: ``struct virtio_gpu_resp_edid`` (from virtio specification)
+
+ Retrieve the EDID data for a given scanout.
+ This message requires the ``VHOST_USER_GPU_PROTOCOL_F_EDID`` protocol
+ feature to be supported.
+
+``VHOST_USER_GPU_DMABUF_SCANOUT2``
+ :id: 12
+ :request payload: ``VhostUserGpuDMABUFScanout2``
+ :reply payload: N/A
+
+ Same as VHOST_USER_GPU_DMABUF_SCANOUT, but also sends the dmabuf modifiers
+ appended to the message, which were not provided in the other message.
+ This message requires the ``VHOST_USER_GPU_PROTOCOL_F_DMABUF2`` protocol
+ feature to be supported.
diff --git a/docs/interop/vhost-user.rst b/docs/interop/vhost-user.rst
index edc3ad84a3..d8419fd2f1 100644
--- a/docs/interop/vhost-user.rst
+++ b/docs/interop/vhost-user.rst
@@ -23,21 +23,41 @@ space process on the same host. It uses communication over a Unix
domain socket to share file descriptors in the ancillary data of the
message.
-The protocol defines 2 sides of the communication, *master* and
-*slave*. *Master* is the application that shares its virtqueues, in
-our case QEMU. *Slave* is the consumer of the virtqueues.
+The protocol defines 2 sides of the communication, *front-end* and
+*back-end*. The *front-end* is the application that shares its virtqueues, in
+our case QEMU. The *back-end* is the consumer of the virtqueues.
-In the current implementation QEMU is the *master*, and the *slave* is
-the external process consuming the virtio queues, for example a
+In the current implementation QEMU is the *front-end*, and the *back-end*
+is the external process consuming the virtio queues, for example a
software Ethernet switch running in user space, such as Snabbswitch,
-or a block device backend processing read & write to a virtual
-disk. In order to facilitate interoperability between various backend
+or a block device back-end processing read & write to a virtual
+disk. In order to facilitate interoperability between various back-end
implementations, it is recommended to follow the :ref:`Backend program
conventions <backend_conventions>`.
-*Master* and *slave* can be either a client (i.e. connecting) or
+The *front-end* and *back-end* can be either a client (i.e. connecting) or
server (listening) in the socket communication.
+Support for platforms other than Linux
+--------------------------------------
+
+While vhost-user was initially developed targeting Linux, nowadays it
+is supported on any platform that provides the following features:
+
+- A way for requesting shared memory represented by a file descriptor
+ so it can be passed over a UNIX domain socket and then mapped by the
+ other process.
+
+- AF_UNIX sockets with SCM_RIGHTS, so QEMU and the other process can
+ exchange messages through it, including ancillary data when needed.
+
+- Either eventfd or pipe/pipe2. On platforms where eventfd is not
+ available, QEMU will automatically fall back to pipe2 or, as a last
+ resort, pipe. Each file descriptor will be used for receiving or
+ sending events by reading or writing (respectively) an 8-byte value
+ to the corresponding it. The 8-value itself has no meaning and
+ should not be interpreted.
+
Message Specification
=====================
@@ -57,7 +77,7 @@ Header
:flags: 32-bit bit field
- Lower 2 bits are the version (currently 0x01)
-- Bit 2 is the reply flag - needs to be sent on each reply from the slave
+- Bit 2 is the reply flag - needs to be sent on each reply from the back-end
- Bit 3 is the need_reply flag - see :ref:`REPLY_ACK <reply_ack>` for
details.
@@ -88,12 +108,49 @@ A vring state description
:num: a 32-bit number
+A vring descriptor index for split virtqueues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++-------------+---------------------+
+| vring index | index in avail ring |
++-------------+---------------------+
+
+:vring index: 32-bit index of the respective virtqueue
+
+:index in avail ring: 32-bit value, of which currently only the lower 16
+ bits are used:
+
+ - Bits 0–15: Index of the next *Available Ring* descriptor that the
+ back-end will process. This is a free-running index that is not
+ wrapped by the ring size.
+ - Bits 16–31: Reserved (set to zero)
+
+Vring descriptor indices for packed virtqueues
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++-------------+--------------------+
+| vring index | descriptor indices |
++-------------+--------------------+
+
+:vring index: 32-bit index of the respective virtqueue
+
+:descriptor indices: 32-bit value:
+
+ - Bits 0–14: Index of the next *Available Ring* descriptor that the
+ back-end will process. This is a free-running index that is not
+ wrapped by the ring size.
+ - Bit 15: Driver (Available) Ring Wrap Counter
+ - Bits 16–30: Index of the entry in the *Used Ring* where the back-end
+ will place the next descriptor. This is a free-running index that
+ is not wrapped by the ring size.
+ - Bit 31: Device (Used) Ring Wrap Counter
+
A vring address description
^^^^^^^^^^^^^^^^^^^^^^^^^^^
-+-------+-------+------+------------+------+-----------+-----+
-| index | flags | size | descriptor | used | available | log |
-+-------+-------+------+------------+------+-----------+-----+
++-------+-------+------------+------+-----------+-----+
+| index | flags | descriptor | used | available | log |
++-------+-------+------------+------+-----------+-----+
:index: a 32-bit vring index
@@ -110,18 +167,8 @@ A vring address description
Note that a ring address is an IOVA if ``VIRTIO_F_IOMMU_PLATFORM`` has
been negotiated. Otherwise it is a user address.
-Memory regions description
-^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-+-------------+---------+---------+-----+---------+
-| num regions | padding | region0 | ... | region7 |
-+-------------+---------+---------+-----+---------+
-
-:num regions: a 32-bit number of regions
-
-:padding: 32-bit
-
-A region is:
+Memory region description
+^^^^^^^^^^^^^^^^^^^^^^^^^
+---------------+------+--------------+-------------+
| guest address | size | user address | mmap offset |
@@ -135,22 +182,49 @@ A region is:
:mmap offset: 64-bit offset where region starts in the mapped memory
+When the ``VHOST_USER_PROTOCOL_F_XEN_MMAP`` protocol feature has been
+successfully negotiated, the memory region description contains two extra
+fields at the end.
+
++---------------+------+--------------+-------------+----------------+-------+
+| guest address | size | user address | mmap offset | xen mmap flags | domid |
++---------------+------+--------------+-------------+----------------+-------+
+
+:xen mmap flags: 32-bit bit field
+
+- Bit 0 is set for Xen foreign memory mapping.
+- Bit 1 is set for Xen grant memory mapping.
+- Bit 8 is set if the memory region can not be mapped in advance, and memory
+ areas within this region must be mapped / unmapped only when required by the
+ back-end. The back-end shouldn't try to map the entire region at once, as the
+ front-end may not allow it. The back-end should rather map only the required
+ amount of memory at once and unmap it after it is used.
+
+:domid: a 32-bit Xen hypervisor specific domain id.
+
Single memory region description
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-+---------+---------------+------+--------------+-------------+
-| padding | guest address | size | user address | mmap offset |
-+---------+---------------+------+--------------+-------------+
++---------+--------+
+| padding | region |
++---------+--------+
:padding: 64-bit
-:guest address: a 64-bit guest address of the region
+A region is represented by Memory region description.
-:size: a 64-bit size
+Multiple Memory regions description
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-:user address: a 64-bit user address
++-------------+---------+---------+-----+---------+
+| num regions | padding | region0 | ... | region7 |
++-------------+---------+---------+-----+---------+
-:mmap offset: 64-bit offset where region starts in the mapped memory
+:num regions: a 32-bit number of regions
+
+:padding: 32-bit
+
+A region is represented by Memory region description.
Log description
^^^^^^^^^^^^^^^
@@ -202,8 +276,8 @@ Virtio device config space
:size: a 32-bit configuration space access size in bytes
:flags: a 32-bit value:
- - 0: Vhost master messages used for writeable fields
- - 1: Vhost master messages used for live migration
+ - 0: Vhost front-end messages used for writable fields
+ - 1: Vhost front-end messages used for live migration
:payload: Size bytes array holding the contents of the virtio
device's configuration space
@@ -238,6 +312,42 @@ Inflight description
:queue size: a 16-bit size of virtqueues
+VhostUserShared
+^^^^^^^^^^^^^^^
+
++------+
+| UUID |
++------+
+
+:UUID: 16 bytes UUID, whose first three components (a 32-bit value, then
+ two 16-bit values) are stored in big endian.
+
+Device state transfer parameters
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
++--------------------+-----------------+
+| transfer direction | migration phase |
++--------------------+-----------------+
+
+:transfer direction: a 32-bit enum, describing the direction in which
+ the state is transferred:
+
+ - 0: Save: Transfer the state from the back-end to the front-end,
+ which happens on the source side of migration
+ - 1: Load: Transfer the state from the front-end to the back-end,
+ which happens on the destination side of migration
+
+:migration phase: a 32-bit enum, describing the state in which the VM
+ guest and devices are:
+
+ - 0: Stopped (in the period after the transfer of memory-mapped
+ regions before switch-over to the destination): The VM guest is
+ stopped, and the vhost-user device is suspended (see
+ :ref:`Suspended device state <suspended_device_state>`).
+
+ In the future, additional phases might be added e.g. to allow
+ iterative migration while the device is running.
+
C structure
-----------
@@ -270,8 +380,8 @@ vhost for the Linux Kernel. Most messages that can be sent via the
Unix domain socket implementing vhost-user have an equivalent ioctl to
the kernel implementation.
-The communication consists of *master* sending message requests and
-*slave* sending message replies. Most of the requests don't require
+The communication consists of the *front-end* sending message requests and
+the *back-end* sending message replies. Most of the requests don't require
replies. Here is a list of the ones that do:
* ``VHOST_USER_GET_FEATURES``
@@ -285,112 +395,137 @@ replies. Here is a list of the ones that do:
:ref:`REPLY_ACK <reply_ack>`
The section on ``REPLY_ACK`` protocol extension.
-There are several messages that the master sends with file descriptors passed
+There are several messages that the front-end sends with file descriptors passed
in the ancillary data:
+* ``VHOST_USER_ADD_MEM_REG``
* ``VHOST_USER_SET_MEM_TABLE``
* ``VHOST_USER_SET_LOG_BASE`` (if ``VHOST_USER_PROTOCOL_F_LOG_SHMFD``)
* ``VHOST_USER_SET_LOG_FD``
* ``VHOST_USER_SET_VRING_KICK``
* ``VHOST_USER_SET_VRING_CALL``
* ``VHOST_USER_SET_VRING_ERR``
-* ``VHOST_USER_SET_SLAVE_REQ_FD``
+* ``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``)
* ``VHOST_USER_SET_INFLIGHT_FD`` (if ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD``)
+* ``VHOST_USER_SET_DEVICE_STATE_FD``
-If *master* is unable to send the full message or receives a wrong
+If *front-end* is unable to send the full message or receives a wrong
reply it will close the connection. An optional reconnection mechanism
can be implemented.
-If *slave* detects some error such as incompatible features, it may also
+If *back-end* detects some error such as incompatible features, it may also
close the connection. This should only happen in exceptional circumstances.
Any protocol extensions are gated by protocol feature bits, which
-allows full backwards compatibility on both master and slave. As
-older slaves don't support negotiating protocol features, a feature
+allows full backwards compatibility on both front-end and back-end. As
+older back-ends don't support negotiating protocol features, a feature
bit was dedicated for this purpose::
#define VHOST_USER_F_PROTOCOL_FEATURES 30
-Starting and stopping rings
----------------------------
+Note that VHOST_USER_F_PROTOCOL_FEATURES is the UNUSED (30) feature
+bit defined in `VIRTIO 1.1 6.3 Legacy Interface: Reserved Feature Bits
+<https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-4130003>`_.
+VIRTIO devices do not advertise this feature bit and therefore VIRTIO
+drivers cannot negotiate it.
-Client must only process each ring when it is started.
+This reserved feature bit was reused by the vhost-user protocol to add
+vhost-user protocol feature negotiation in a backwards compatible
+fashion. Old vhost-user front-end and back-end implementations continue to
+work even though they are not aware of vhost-user protocol feature
+negotiation.
-Client must only pass data between the ring and the backend, when the
-ring is enabled.
-
-If ring is started but disabled, client must process the ring without
-talking to the backend.
+Ring states
+-----------
-For example, for a networking device, in the disabled state client
-must not supply any new RX packets, but must process and discard any
-TX packets.
+Rings have two independent states: started/stopped, and enabled/disabled.
-If ``VHOST_USER_F_PROTOCOL_FEATURES`` has not been negotiated, the
-ring is initialized in an enabled state.
+* While a ring is stopped, the back-end must not process the ring at
+ all, regardless of whether it is enabled or disabled. The
+ enabled/disabled state should still be tracked, though, so it can come
+ into effect once the ring is started.
-If ``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, the ring is
-initialized in a disabled state. Client must not pass data to/from the
-backend until ring is enabled by ``VHOST_USER_SET_VRING_ENABLE`` with
-parameter 1, or after it has been disabled by
-``VHOST_USER_SET_VRING_ENABLE`` with parameter 0.
+* started and disabled: The back-end must process the ring without
+ causing any side effects. For example, for a networking device,
+ in the disabled state the back-end must not supply any new RX packets,
+ but must process and discard any TX packets.
-Each ring is initialized in a stopped state, client must not process
-it until ring is started, or after it has been stopped.
+* started and enabled: The back-end must process the ring normally, i.e.
+ process all requests and execute them.
-Client must start ring upon receiving a kick (that is, detecting that
-file descriptor is readable) on the descriptor specified by
+Each ring is initialized in a stopped and disabled state. The back-end
+must start a ring upon receiving a kick (that is, detecting that file
+descriptor is readable) on the descriptor specified by
``VHOST_USER_SET_VRING_KICK`` or receiving the in-band message
-``VHOST_USER_VRING_KICK`` if negotiated, and stop ring upon receiving
+``VHOST_USER_VRING_KICK`` if negotiated, and stop a ring upon receiving
``VHOST_USER_GET_VRING_BASE``.
-While processing the rings (whether they are enabled or not), client
+Rings can be enabled or disabled by ``VHOST_USER_SET_VRING_ENABLE``.
+
+In addition, upon receiving a ``VHOST_USER_SET_FEATURES`` message from
+the front-end without ``VHOST_USER_F_PROTOCOL_FEATURES`` set, the
+back-end must enable all rings immediately.
+
+While processing the rings (whether they are enabled or not), the back-end
must support changing some configuration aspects on the fly.
+.. _suspended_device_state:
+
+Suspended device state
+^^^^^^^^^^^^^^^^^^^^^^
+
+While all vrings are stopped, the device is *suspended*. In addition to
+not processing any vring (because they are stopped), the device must:
+
+* not write to any guest memory regions,
+* not send any notifications to the guest,
+* not send any messages to the front-end,
+* still process and reply to messages from the front-end.
+
Multiple queue support
----------------------
-Many devices have a fixed number of virtqueues. In this case the master
+Many devices have a fixed number of virtqueues. In this case the front-end
already knows the number of available virtqueues without communicating with the
-slave.
+back-end.
Some devices do not have a fixed number of virtqueues. Instead the maximum
-number of virtqueues is chosen by the slave. The number can depend on host
-resource availability or slave implementation details. Such devices are called
+number of virtqueues is chosen by the back-end. The number can depend on host
+resource availability or back-end implementation details. Such devices are called
multiple queue devices.
-Multiple queue support allows the slave to advertise the maximum number of
-queues. This is treated as a protocol extension, hence the slave has to
+Multiple queue support allows the back-end to advertise the maximum number of
+queues. This is treated as a protocol extension, hence the back-end has to
implement protocol features first. The multiple queues feature is supported
only when the protocol feature ``VHOST_USER_PROTOCOL_F_MQ`` (bit 0) is set.
-The max number of queues the slave supports can be queried with message
-``VHOST_USER_GET_QUEUE_NUM``. Master should stop when the number of requested
+The max number of queues the back-end supports can be queried with message
+``VHOST_USER_GET_QUEUE_NUM``. Front-end should stop when the number of requested
queues is bigger than that.
-As all queues share one connection, the master uses a unique index for each
+As all queues share one connection, the front-end uses a unique index for each
queue in the sent message to identify a specified queue.
-The master enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
+The front-end enables queues by sending message ``VHOST_USER_SET_VRING_ENABLE``.
vhost-user-net has historically automatically enabled the first queue pair.
-Slaves should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
+Back-ends should always implement the ``VHOST_USER_PROTOCOL_F_MQ`` protocol
feature, even for devices with a fixed number of virtqueues, since it is simple
to implement and offers a degree of introspection.
-Masters must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
+Front-ends must not rely on the ``VHOST_USER_PROTOCOL_F_MQ`` protocol feature for
devices with a fixed number of virtqueues. Only true multiqueue devices
require this protocol feature.
Migration
---------
-During live migration, the master may need to track the modifications
-the slave makes to the memory mapped regions. The client should mark
+During live migration, the front-end may need to track the modifications
+the back-end makes to the memory mapped regions. The front-end should mark
the dirty pages in a log. Once it complies to this logging, it may
declare the ``VHOST_F_LOG_ALL`` vhost feature.
-To start/stop logging of data/used ring writes, server may send
+To start/stop logging of data/used ring writes, the front-end may send
messages ``VHOST_USER_SET_FEATURES`` with ``VHOST_F_LOG_ALL`` and
``VHOST_USER_SET_VRING_ADDR`` with ``VHOST_VRING_F_LOG`` in ring's
flags set to 1/0, respectively.
@@ -404,7 +539,7 @@ Dirty pages are of size::
#define VHOST_LOG_PAGE 0x1000
The log memory fd is provided in the ancillary data of
-``VHOST_USER_SET_LOG_BASE`` message when the slave has
+``VHOST_USER_SET_LOG_BASE`` message when the back-end has
``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature.
The size of the log is supplied as part of ``VhostUserMsg`` which
@@ -430,26 +565,101 @@ the bit offset of the last byte of the ring must fall within the size
supplied by ``VhostUserLog``.
``VHOST_USER_SET_LOG_FD`` is an optional message with an eventfd in
-ancillary data, it may be used to inform the master that the log has
+ancillary data, it may be used to inform the front-end that the log has
been modified.
Once the source has finished migration, rings will be stopped by the
-source. No further update must be done before rings are restarted.
+source (:ref:`Suspended device state <suspended_device_state>`). No
+further update must be done before rings are restarted.
-In postcopy migration the slave is started before all the memory has
+In postcopy migration the back-end is started before all the memory has
been received from the source host, and care must be taken to avoid
-accessing pages that have yet to be received. The slave opens a
+accessing pages that have yet to be received. The back-end opens a
'userfault'-fd and registers the memory with it; this fd is then
-passed back over to the master. The master services requests on the
+passed back over to the front-end. The front-end services requests on the
userfaultfd for pages that are accessed and when the page is available
it performs WAKE ioctl's on the userfaultfd to wake the stalled
-slave. The client indicates support for this via the
+back-end. The front-end indicates support for this via the
``VHOST_USER_PROTOCOL_F_PAGEFAULT`` feature.
+.. _migrating_backend_state:
+
+Migrating back-end state
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+Migrating device state involves transferring the state from one
+back-end, called the source, to another back-end, called the
+destination. After migration, the destination transparently resumes
+operation without requiring the driver to re-initialize the device at
+the VIRTIO level. If the migration fails, then the source can
+transparently resume operation until another migration attempt is made.
+
+Generally, the front-end is connected to a virtual machine guest (which
+contains the driver), which has its own state to transfer between source
+and destination, and therefore will have an implementation-specific
+mechanism to do so. The ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature
+provides functionality to have the front-end include the back-end's
+state in this transfer operation so the back-end does not need to
+implement its own mechanism, and so the virtual machine may have its
+complete state, including vhost-user devices' states, contained within a
+single stream of data.
+
+To do this, the back-end state is transferred from back-end to front-end
+on the source side, and vice versa on the destination side. This
+transfer happens over a channel that is negotiated using the
+``VHOST_USER_SET_DEVICE_STATE_FD`` message. This message has two
+parameters:
+
+* Direction of transfer: On the source, the data is saved, transferring
+ it from the back-end to the front-end. On the destination, the data
+ is loaded, transferring it from the front-end to the back-end.
+
+* Migration phase: Currently, the only supported phase is the period
+ after the transfer of memory-mapped regions before switch-over to the
+ destination, when both the source and destination devices are
+ suspended (:ref:`Suspended device state <suspended_device_state>`).
+ In the future, additional phases might be supported to allow iterative
+ migration while the device is running.
+
+The nature of the channel is implementation-defined, but it must
+generally behave like a pipe: The writing end will write all the data it
+has into it, signalling the end of data by closing its end. The reading
+end must read all of this data (until encountering the end of file) and
+process it.
+
+* When saving, the writing end is the source back-end, and the reading
+ end is the source front-end. After reading the state data from the
+ channel, the source front-end must transfer it to the destination
+ front-end through an implementation-defined mechanism.
+
+* When loading, the writing end is the destination front-end, and the
+ reading end is the destination back-end. After reading the state data
+ from the channel, the destination back-end must deserialize its
+ internal state from that data and set itself up to allow the driver to
+ seamlessly resume operation on the VIRTIO level.
+
+Seamlessly resuming operation means that the migration must be
+transparent to the guest driver, which operates on the VIRTIO level.
+This driver will not perform any re-initialization steps, but continue
+to use the device as if no migration had occurred. The vhost-user
+front-end, however, will re-initialize the vhost state on the
+destination, following the usual protocol for establishing a connection
+to a vhost-user back-end: This includes, for example, setting up memory
+mappings and kick and call FDs as necessary, negotiating protocol
+features, or setting the initial vring base indices (to the same value
+as on the source side, so that operation can resume).
+
+Both on the source and on the destination side, after the respective
+front-end has seen all data transferred (when the transfer FD has been
+closed), it sends the ``VHOST_USER_CHECK_DEVICE_STATE`` message to
+verify that data transfer was successful in the back-end, too. The
+back-end responds once it knows whether the transfer and processing was
+successful or not.
+
Memory access
-------------
-The master sends a list of vhost memory regions to the slave using the
+The front-end sends a list of vhost memory regions to the back-end using the
``VHOST_USER_SET_MEM_TABLE`` message. Each region has two base
addresses: a guest address and a user address.
@@ -474,60 +684,60 @@ IOMMU support
-------------
When the ``VIRTIO_F_IOMMU_PLATFORM`` feature has been negotiated, the
-master sends IOTLB entries update & invalidation by sending
-``VHOST_USER_IOTLB_MSG`` requests to the slave with a ``struct
+front-end sends IOTLB entries update & invalidation by sending
+``VHOST_USER_IOTLB_MSG`` requests to the back-end with a ``struct
vhost_iotlb_msg`` as payload. For update events, the ``iotlb`` payload
has to be filled with the update message type (2), the I/O virtual
address, the size, the user virtual address, and the permissions
flags. Addresses and size must be within vhost memory regions set via
the ``VHOST_USER_SET_MEM_TABLE`` request. For invalidation events, the
``iotlb`` payload has to be filled with the invalidation message type
-(3), the I/O virtual address and the size. On success, the slave is
+(3), the I/O virtual address and the size. On success, the back-end is
expected to reply with a zero payload, non-zero otherwise.
-The slave relies on the slave communication channel (see :ref:`Slave
-communication <slave_communication>` section below) to send IOTLB miss
-and access failure events, by sending ``VHOST_USER_SLAVE_IOTLB_MSG``
-requests to the master with a ``struct vhost_iotlb_msg`` as
+The back-end relies on the back-end communication channel (see :ref:`Back-end
+communication <backend_communication>` section below) to send IOTLB miss
+and access failure events, by sending ``VHOST_USER_BACKEND_IOTLB_MSG``
+requests to the front-end with a ``struct vhost_iotlb_msg`` as
payload. For miss events, the iotlb payload has to be filled with the
miss message type (1), the I/O virtual address and the permissions
flags. For access failure event, the iotlb payload has to be filled
with the access failure message type (4), the I/O virtual address and
-the permissions flags. For synchronization purpose, the slave may
-rely on the reply-ack feature, so the master may send a reply when
+the permissions flags. For synchronization purpose, the back-end may
+rely on the reply-ack feature, so the front-end may send a reply when
operation is completed if the reply-ack feature is negotiated and
-slaves requests a reply. For miss events, completed operation means
-either master sent an update message containing the IOTLB entry
-containing requested address and permission, or master sent nothing if
+back-ends requests a reply. For miss events, completed operation means
+either front-end sent an update message containing the IOTLB entry
+containing requested address and permission, or front-end sent nothing if
the IOTLB miss message is invalid (invalid IOVA or permission).
-The master isn't expected to take the initiative to send IOTLB update
-messages, as the slave sends IOTLB miss messages for the guest virtual
+The front-end isn't expected to take the initiative to send IOTLB update
+messages, as the back-end sends IOTLB miss messages for the guest virtual
memory areas it needs to access.
-.. _slave_communication:
+.. _backend_communication:
-Slave communication
--------------------
+Back-end communication
+----------------------
-An optional communication channel is provided if the slave declares
-``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` protocol feature, to allow the
-slave to make requests to the master.
+An optional communication channel is provided if the back-end declares
+``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` protocol feature, to allow the
+back-end to make requests to the front-end.
-The fd is provided via ``VHOST_USER_SET_SLAVE_REQ_FD`` ancillary data.
+The fd is provided via ``VHOST_USER_SET_BACKEND_REQ_FD`` ancillary data.
-A slave may then send ``VHOST_USER_SLAVE_*`` messages to the master
+A back-end may then send ``VHOST_USER_BACKEND_*`` messages to the front-end
using this fd communication channel.
-If ``VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD`` protocol feature is
-negotiated, slave can send file descriptors (at most 8 descriptors in
-each message) to master via ancillary data using this fd communication
+If ``VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD`` protocol feature is
+negotiated, back-end can send file descriptors (at most 8 descriptors in
+each message) to front-end via ancillary data using this fd communication
channel.
Inflight I/O tracking
---------------------
-To support reconnecting after restart or crash, slave may need to
+To support reconnecting after restart or crash, back-end may need to
resubmit inflight I/Os. If virtqueue is processed in order, we can
easily achieve that by getting the inflight descriptors from
descriptor table (split virtqueue) or descriptor ring (packed
@@ -535,18 +745,18 @@ virtqueue). However, it can't work when we process descriptors
out-of-order because some entries which store the information of
inflight descriptors in available ring (split virtqueue) or descriptor
ring (packed virtqueue) might be overridden by new entries. To solve
-this problem, slave need to allocate an extra buffer to store this
-information of inflight descriptors and share it with master for
+this problem, the back-end need to allocate an extra buffer to store this
+information of inflight descriptors and share it with front-end for
persistent. ``VHOST_USER_GET_INFLIGHT_FD`` and
``VHOST_USER_SET_INFLIGHT_FD`` are used to transfer this buffer
-between master and slave. And the format of this buffer is described
+between front-end and back-end. And the format of this buffer is described
below:
+---------------+---------------+-----+---------------+
| queue0 region | queue1 region | ... | queueN region |
+---------------+---------------+-----+---------------+
-N is the number of available virtqueues. Slave could get it from num
+N is the number of available virtqueues. The back-end could get it from num
queues field of ``VhostUserInflight``.
For split virtqueue, queue region can be implemented as:
@@ -578,8 +788,8 @@ For split virtqueue, queue region can be implemented as:
* Zero value indicates an uninitialized buffer */
uint16_t version;
- /* The size of DescStateSplit array. It's equal to the virtqueue
- * size. Slave could get it from queue size field of VhostUserInflight. */
+ /* The size of DescStateSplit array. It's equal to the virtqueue size.
+ * The back-end could get it from queue size field of VhostUserInflight. */
uint16_t desc_num;
/* The head of list that track the last batch of used descriptors. */
@@ -685,8 +895,8 @@ For packed virtqueue, queue region can be implemented as:
* Zero value indicates an uninitialized buffer */
uint16_t version;
- /* The size of DescStatePacked array. It's equal to the virtqueue
- * size. Slave could get it from queue size field of VhostUserInflight. */
+ /* The size of DescStatePacked array. It's equal to the virtqueue size.
+ * The back-end could get it from queue size field of VhostUserInflight. */
uint16_t desc_num;
/* The head of free DescStatePacked entry list */
@@ -778,8 +988,8 @@ When reconnecting:
#. Use ``old_used_wrap_counter`` to calculate the available flags
#. If ``d.flags`` is not equal to the calculated flags value (means
- slave has submitted the buffer to guest driver before crash, so
- it has to commit the in-progres update), set ``old_free_head``,
+ back-end has submitted the buffer to guest driver before crash, so
+ it has to commit the in-progress update), set ``old_free_head``,
``old_used_idx``, ``old_used_wrap_counter`` to ``free_head``,
``used_idx``, ``used_wrap_counter``
@@ -806,12 +1016,12 @@ Note that due to the fact that too many messages on the sockets can
cause the sending application(s) to block, it is not advised to use
this feature unless absolutely necessary. It is also considered an
error to negotiate this feature without also negotiating
-``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
-the former is necessary for getting a message channel from the slave
-to the master, while the latter needs to be used with the in-band
+``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` and ``VHOST_USER_PROTOCOL_F_REPLY_ACK``,
+the former is necessary for getting a message channel from the back-end
+to the front-end, while the latter needs to be used with the in-band
notification messages to block until they are processed, both to avoid
blocking later and for proper processing (at least in the simulation
-use case.) As it has no other way of signalling this error, the slave
+use case.) As it has no other way of signalling this error, the back-end
should close the connection as a response to a
``VHOST_USER_SET_PROTOCOL_FEATURES`` message that sets the in-band
notifications feature flag without the other two.
@@ -826,108 +1036,117 @@ Protocol features
#define VHOST_USER_PROTOCOL_F_RARP 2
#define VHOST_USER_PROTOCOL_F_REPLY_ACK 3
#define VHOST_USER_PROTOCOL_F_MTU 4
- #define VHOST_USER_PROTOCOL_F_SLAVE_REQ 5
+ #define VHOST_USER_PROTOCOL_F_BACKEND_REQ 5
#define VHOST_USER_PROTOCOL_F_CROSS_ENDIAN 6
#define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7
#define VHOST_USER_PROTOCOL_F_PAGEFAULT 8
#define VHOST_USER_PROTOCOL_F_CONFIG 9
- #define VHOST_USER_PROTOCOL_F_SLAVE_SEND_FD 10
+ #define VHOST_USER_PROTOCOL_F_BACKEND_SEND_FD 10
#define VHOST_USER_PROTOCOL_F_HOST_NOTIFIER 11
#define VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD 12
#define VHOST_USER_PROTOCOL_F_RESET_DEVICE 13
#define VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS 14
#define VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS 15
#define VHOST_USER_PROTOCOL_F_STATUS 16
+ #define VHOST_USER_PROTOCOL_F_XEN_MMAP 17
+ #define VHOST_USER_PROTOCOL_F_SHARED_OBJECT 18
+ #define VHOST_USER_PROTOCOL_F_DEVICE_STATE 19
-Master message types
---------------------
+Front-end message types
+-----------------------
``VHOST_USER_GET_FEATURES``
:id: 1
:equivalent ioctl: ``VHOST_GET_FEATURES``
- :master payload: N/A
- :slave payload: ``u64``
+ :request payload: N/A
+ :reply payload: ``u64``
Get from the underlying vhost implementation the features bitmask.
- Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals slave support
+ Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals back-end support
for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
``VHOST_USER_SET_PROTOCOL_FEATURES``.
``VHOST_USER_SET_FEATURES``
:id: 2
:equivalent ioctl: ``VHOST_SET_FEATURES``
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
Enable features in the underlying vhost implementation using a
bitmask. Feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` signals
- slave support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
+ back-end support for ``VHOST_USER_GET_PROTOCOL_FEATURES`` and
``VHOST_USER_SET_PROTOCOL_FEATURES``.
``VHOST_USER_GET_PROTOCOL_FEATURES``
:id: 15
:equivalent ioctl: ``VHOST_GET_FEATURES``
- :master payload: N/A
- :slave payload: ``u64``
+ :request payload: N/A
+ :reply payload: ``u64``
Get the protocol feature bitmask from the underlying vhost
implementation. Only legal if feature bit
``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
- ``VHOST_USER_GET_FEATURES``.
+ ``VHOST_USER_GET_FEATURES``. It does not need to be acknowledged by
+ ``VHOST_USER_SET_FEATURES``.
.. Note::
- Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must
+ Back-ends that report ``VHOST_USER_F_PROTOCOL_FEATURES`` must
support this message even before ``VHOST_USER_SET_FEATURES`` was
called.
``VHOST_USER_SET_PROTOCOL_FEATURES``
:id: 16
:equivalent ioctl: ``VHOST_SET_FEATURES``
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
Enable protocol features in the underlying vhost implementation.
Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is present in
- ``VHOST_USER_GET_FEATURES``.
+ ``VHOST_USER_GET_FEATURES``. It does not need to be acknowledged by
+ ``VHOST_USER_SET_FEATURES``.
.. Note::
- Slave that reported ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
+ Back-ends that report ``VHOST_USER_F_PROTOCOL_FEATURES`` must support
this message even before ``VHOST_USER_SET_FEATURES`` was called.
``VHOST_USER_SET_OWNER``
:id: 3
:equivalent ioctl: ``VHOST_SET_OWNER``
- :master payload: N/A
+ :request payload: N/A
+ :reply payload: N/A
- Issued when a new connection is established. It sets the current
- *master* as an owner of the session. This can be used on the *slave*
+ Issued when a new connection is established. It marks the sender
+ as the front-end that owns of the session. This can be used on the *back-end*
as a "session start" flag.
``VHOST_USER_RESET_OWNER``
:id: 4
- :master payload: N/A
+ :request payload: N/A
+ :reply payload: N/A
.. admonition:: Deprecated
This is no longer used. Used to be sent to request disabling all
- rings, but some clients interpreted it to also discard connection
+ rings, but some back-ends interpreted it to also discard connection
state (this interpretation would lead to bugs). It is recommended
- that clients either ignore this message, or use it to disable all
+ that back-ends either ignore this message, or use it to disable all
rings.
``VHOST_USER_SET_MEM_TABLE``
:id: 5
:equivalent ioctl: ``VHOST_SET_MEM_TABLE``
- :master payload: memory regions description
- :slave payload: (postcopy only) memory regions description
+ :request payload: multiple memory regions description
+ :reply payload: (postcopy only) multiple memory regions description
- Sets the memory map regions on the slave so it can translate the
+ Sets the memory map regions on the back-end so it can translate the
vring addresses. In the ancillary data there is an array of file
descriptors for each memory mapped region. The size and ordering of
the fds matches the number and ordering of memory regions.
When ``VHOST_USER_POSTCOPY_LISTEN`` has been received,
``SET_MEM_TABLE`` replies with the bases of the memory mapped
- regions to the master. The slave must have mmap'd the regions but
+ regions to the front-end. The back-end must have mmap'd the regions but
not yet accessed them and should not yet generate a userfault
event.
@@ -941,12 +1160,12 @@ Master message types
``VHOST_USER_SET_LOG_BASE``
:id: 6
:equivalent ioctl: ``VHOST_SET_LOG_BASE``
- :master payload: u64
- :slave payload: N/A
+ :request payload: u64
+ :reply payload: N/A
Sets logging shared memory space.
- When slave has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
+ When the back-end has ``VHOST_USER_PROTOCOL_F_LOG_SHMFD`` protocol feature,
the log memory fd is provided in the ancillary data of
``VHOST_USER_SET_LOG_BASE`` message, the size and offset of shared
memory area provided in the message.
@@ -954,44 +1173,84 @@ Master message types
``VHOST_USER_SET_LOG_FD``
:id: 7
:equivalent ioctl: ``VHOST_SET_LOG_FD``
- :master payload: N/A
+ :request payload: N/A
+ :reply payload: N/A
Sets the logging file descriptor, which is passed as ancillary data.
``VHOST_USER_SET_VRING_NUM``
:id: 8
:equivalent ioctl: ``VHOST_SET_VRING_NUM``
- :master payload: vring state description
+ :request payload: vring state description
+ :reply payload: N/A
Set the size of the queue.
``VHOST_USER_SET_VRING_ADDR``
:id: 9
:equivalent ioctl: ``VHOST_SET_VRING_ADDR``
- :master payload: vring address description
- :slave payload: N/A
+ :request payload: vring address description
+ :reply payload: N/A
Sets the addresses of the different aspects of the vring.
``VHOST_USER_SET_VRING_BASE``
:id: 10
:equivalent ioctl: ``VHOST_SET_VRING_BASE``
- :master payload: vring state description
+ :request payload: vring descriptor index/indices
+ :reply payload: N/A
+
+ Sets the next index to use for descriptors in this vring:
- Sets the base offset in the available vring.
+ * For a split virtqueue, sets only the next descriptor index to
+ process in the *Available Ring*. The device is supposed to read the
+ next index in the *Used Ring* from the respective vring structure in
+ guest memory.
+
+ * For a packed virtqueue, both indices are supplied, as they are not
+ explicitly available in memory.
+
+ Consequently, the payload type is specific to the type of virt queue
+ (*a vring descriptor index for split virtqueues* vs. *vring descriptor
+ indices for packed virtqueues*).
``VHOST_USER_GET_VRING_BASE``
:id: 11
:equivalent ioctl: ``VHOST_USER_GET_VRING_BASE``
- :master payload: vring state description
- :slave payload: vring state description
+ :request payload: vring state description
+ :reply payload: vring descriptor index/indices
+
+ Stops the vring and returns the current descriptor index or indices:
+
+ * For a split virtqueue, returns only the 16-bit next descriptor
+ index to process in the *Available Ring*. Note that this may
+ differ from the available ring index in the vring structure in
+ memory, which points to where the driver will put new available
+ descriptors. For the *Used Ring*, the device only needs the next
+ descriptor index at which to put new descriptors, which is the
+ value in the vring structure in memory, so this value is not
+ covered by this message.
+
+ * For a packed virtqueue, neither index is explicitly available to
+ read from memory, so both indices (as maintained by the device) are
+ returned.
+
+ Consequently, the payload type is specific to the type of virt queue
+ (*a vring descriptor index for split virtqueues* vs. *vring descriptor
+ indices for packed virtqueues*).
- Get the available vring base offset.
+ When and as long as all of a device’s vrings are stopped, it is
+ *suspended*, see :ref:`Suspended device state
+ <suspended_device_state>`.
+
+ The request payload’s *num* field is currently reserved and must be
+ set to 0.
``VHOST_USER_SET_VRING_KICK``
:id: 12
:equivalent ioctl: ``VHOST_SET_VRING_KICK``
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
Set the event file descriptor for adding buffers to the vring. It is
passed in the ancillary data.
@@ -1009,7 +1268,8 @@ Master message types
``VHOST_USER_SET_VRING_CALL``
:id: 13
:equivalent ioctl: ``VHOST_SET_VRING_CALL``
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
Set the event file descriptor to signal when buffers are used. It is
passed in the ancillary data.
@@ -1019,15 +1279,16 @@ Master message types
in the ancillary data. This signals that polling will be used
instead of waiting for the call. Note that if the protocol features
``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
- ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
- isn't necessary as the ``VHOST_USER_SLAVE_VRING_CALL`` message can be
+ ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` have been negotiated this message
+ isn't necessary as the ``VHOST_USER_BACKEND_VRING_CALL`` message can be
used, it may however still be used to set an event file descriptor
or to enable polling.
``VHOST_USER_SET_VRING_ERR``
:id: 14
:equivalent ioctl: ``VHOST_SET_VRING_ERR``
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
Set the event file descriptor to signal when error occurs. It is
passed in the ancillary data.
@@ -1036,18 +1297,18 @@ Master message types
invalid FD flag. This flag is set when there is no file descriptor
in the ancillary data. Note that if the protocol features
``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` and
- ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` have been negotiated this message
- isn't necessary as the ``VHOST_USER_SLAVE_VRING_ERR`` message can be
+ ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` have been negotiated this message
+ isn't necessary as the ``VHOST_USER_BACKEND_VRING_ERR`` message can be
used, it may however still be used to set an event file descriptor
(which will be preferred over the message).
``VHOST_USER_GET_QUEUE_NUM``
:id: 17
:equivalent ioctl: N/A
- :master payload: N/A
- :slave payload: u64
+ :request payload: N/A
+ :reply payload: u64
- Query how many queues the backend supports.
+ Query how many queues the back-end supports.
This request should be sent only when ``VHOST_USER_PROTOCOL_F_MQ``
is set in queried protocol features by
@@ -1056,9 +1317,10 @@ Master message types
``VHOST_USER_SET_VRING_ENABLE``
:id: 18
:equivalent ioctl: N/A
- :master payload: vring state description
+ :request payload: vring state description
+ :reply payload: N/A
- Signal slave to enable or disable corresponding vring.
+ Signal the back-end to enable or disable corresponding vring.
This request should be sent only when
``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated.
@@ -1066,9 +1328,10 @@ Master message types
``VHOST_USER_SEND_RARP``
:id: 19
:equivalent ioctl: N/A
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
- Ask vhost user backend to broadcast a fake RARP to notify the migration
+ Ask vhost user back-end to broadcast a fake RARP to notify the migration
is terminated for guest that does not support GUEST_ANNOUNCE.
Only legal if feature bit ``VHOST_USER_F_PROTOCOL_FEATURES`` is
@@ -1076,12 +1339,13 @@ Master message types
``VHOST_USER_PROTOCOL_F_RARP`` is present in
``VHOST_USER_GET_PROTOCOL_FEATURES``. The first 6 bytes of the
payload contain the mac address of the guest to allow the vhost user
- backend to construct and broadcast the fake RARP.
+ back-end to construct and broadcast the fake RARP.
``VHOST_USER_NET_SET_MTU``
:id: 20
:equivalent ioctl: N/A
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
Set host MTU value exposed to the guest.
@@ -1091,35 +1355,36 @@ Master message types
``VHOST_USER_PROTOCOL_F_NET_MTU`` is present in
``VHOST_USER_GET_PROTOCOL_FEATURES``.
- If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
+ If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, the back-end must
respond with zero in case the specified MTU is valid, or non-zero
otherwise.
-``VHOST_USER_SET_SLAVE_REQ_FD``
+``VHOST_USER_SET_BACKEND_REQ_FD`` (previous name ``VHOST_USER_SET_SLAVE_REQ_FD``)
:id: 21
:equivalent ioctl: N/A
- :master payload: N/A
+ :request payload: N/A
+ :reply payload: N/A
- Set the socket file descriptor for slave initiated requests. It is passed
+ Set the socket file descriptor for back-end initiated requests. It is passed
in the ancillary data.
This request should be sent only when
``VHOST_USER_F_PROTOCOL_FEATURES`` has been negotiated, and protocol
- feature bit ``VHOST_USER_PROTOCOL_F_SLAVE_REQ`` bit is present in
+ feature bit ``VHOST_USER_PROTOCOL_F_BACKEND_REQ`` bit is present in
``VHOST_USER_GET_PROTOCOL_FEATURES``. If
- ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, slave must
+ ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, the back-end must
respond with zero for success, non-zero otherwise.
``VHOST_USER_IOTLB_MSG``
:id: 22
:equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
- :master payload: ``struct vhost_iotlb_msg``
- :slave payload: ``u64``
+ :request payload: ``struct vhost_iotlb_msg``
+ :reply payload: ``u64``
Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
- Master sends such requests to update and invalidate entries in the
- device IOTLB. The slave has to acknowledge the request with sending
+ The front-end sends such requests to update and invalidate entries in the
+ device IOTLB. The back-end has to acknowledge the request with sending
zero as ``u64`` payload for success, non-zero otherwise.
This request should be send only when ``VIRTIO_F_IOMMU_PLATFORM``
@@ -1128,7 +1393,8 @@ Master message types
``VHOST_USER_SET_VRING_ENDIAN``
:id: 23
:equivalent ioctl: ``VHOST_SET_VRING_ENDIAN``
- :master payload: vring state description
+ :request payload: vring state description
+ :reply payload: N/A
Set the endianness of a VQ for legacy devices. Little-endian is
indicated with state.num set to 0 and big-endian is indicated with
@@ -1138,42 +1404,42 @@ Master message types
``VHOST_USER_PROTOCOL_F_CROSS_ENDIAN`` has been negotiated.
Backends that negotiated this feature should handle both
endiannesses and expect this message once (per VQ) during device
- configuration (ie. before the master starts the VQ).
+ configuration (ie. before the front-end starts the VQ).
``VHOST_USER_GET_CONFIG``
:id: 24
:equivalent ioctl: N/A
- :master payload: virtio device config space
- :slave payload: virtio device config space
+ :request payload: virtio device config space
+ :reply payload: virtio device config space
When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
- submitted by the vhost-user master to fetch the contents of the
- virtio device configuration space, vhost-user slave's payload size
- MUST match master's request, vhost-user slave uses zero length of
- payload to indicate an error to vhost-user master. The vhost-user
- master may cache the contents to avoid repeated
+ submitted by the vhost-user front-end to fetch the contents of the
+ virtio device configuration space, vhost-user back-end's payload size
+ MUST match the front-end's request, vhost-user back-end uses zero length of
+ payload to indicate an error to the vhost-user front-end. The vhost-user
+ front-end may cache the contents to avoid repeated
``VHOST_USER_GET_CONFIG`` calls.
``VHOST_USER_SET_CONFIG``
:id: 25
:equivalent ioctl: N/A
- :master payload: virtio device config space
- :slave payload: N/A
+ :request payload: virtio device config space
+ :reply payload: N/A
When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
- submitted by the vhost-user master when the Guest changes the virtio
+ submitted by the vhost-user front-end when the Guest changes the virtio
device configuration space and also can be used for live migration
- on the destination host. The vhost-user slave must check the flags
- field, and slaves MUST NOT accept SET_CONFIG for read-only
+ on the destination host. The vhost-user back-end must check the flags
+ field, and back-ends MUST NOT accept SET_CONFIG for read-only
configuration space fields unless the live migration bit is set.
``VHOST_USER_CREATE_CRYPTO_SESSION``
:id: 26
:equivalent ioctl: N/A
- :master payload: crypto session description
- :slave payload: crypto session description
+ :request payload: crypto session description
+ :reply payload: crypto session description
- Create a session for crypto operation. The server side must return
+ Create a session for crypto operation. The back-end must return
the session id, 0 or positive for success, negative for failure.
This request should be sent only when
``VHOST_USER_PROTOCOL_F_CRYPTO_SESSION`` feature has been
@@ -1183,7 +1449,8 @@ Master message types
``VHOST_USER_CLOSE_CRYPTO_SESSION``
:id: 27
:equivalent ioctl: N/A
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
Close a session for crypto operation which was previously
created by ``VHOST_USER_CREATE_CRYPTO_SESSION``.
@@ -1195,20 +1462,21 @@ Master message types
``VHOST_USER_POSTCOPY_ADVISE``
:id: 28
- :master payload: N/A
- :slave payload: userfault fd
+ :request payload: N/A
+ :reply payload: userfault fd
- When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the master
- advises slave that a migration with postcopy enabled is underway,
- the slave must open a userfaultfd for later use. Note that at this
+ When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, the front-end
+ advises back-end that a migration with postcopy enabled is underway,
+ the back-end must open a userfaultfd for later use. Note that at this
stage the migration is still in precopy mode.
``VHOST_USER_POSTCOPY_LISTEN``
:id: 29
- :master payload: N/A
+ :request payload: N/A
+ :reply payload: N/A
- Master advises slave that a transition to postcopy mode has
- happened. The slave must ensure that shared memory is registered
+ The front-end advises back-end that a transition to postcopy mode has
+ happened. The back-end must ensure that shared memory is registered
with userfaultfd to cause faulting of non-present pages.
This is always sent sometime after a ``VHOST_USER_POSTCOPY_ADVISE``,
@@ -1216,10 +1484,11 @@ Master message types
``VHOST_USER_POSTCOPY_END``
:id: 30
- :slave payload: ``u64``
+ :request payload: N/A
+ :reply payload: ``u64``
- Master advises that postcopy migration has now completed. The slave
- must disable the userfaultfd. The response is an acknowledgement
+ The front-end advises that postcopy migration has now completed. The back-end
+ must disable the userfaultfd. The reply is an acknowledgement
only.
When ``VHOST_USER_PROTOCOL_F_PAGEFAULT`` is supported, this message
@@ -1231,165 +1500,272 @@ Master message types
``VHOST_USER_GET_INFLIGHT_FD``
:id: 31
:equivalent ioctl: N/A
- :master payload: inflight description
+ :request payload: inflight description
+ :reply payload: N/A
When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
- been successfully negotiated, this message is submitted by master to
- get a shared buffer from slave. The shared buffer will be used to
- track inflight I/O by slave. QEMU should retrieve a new one when vm
+ been successfully negotiated, this message is submitted by the front-end to
+ get a shared buffer from back-end. The shared buffer will be used to
+ track inflight I/O by back-end. QEMU should retrieve a new one when vm
reset.
``VHOST_USER_SET_INFLIGHT_FD``
:id: 32
:equivalent ioctl: N/A
- :master payload: inflight description
+ :request payload: inflight description
+ :reply payload: N/A
When ``VHOST_USER_PROTOCOL_F_INFLIGHT_SHMFD`` protocol feature has
- been successfully negotiated, this message is submitted by master to
- send the shared inflight buffer back to slave so that slave could
- get inflight I/O after a crash or restart.
+ been successfully negotiated, this message is submitted by the front-end to
+ send the shared inflight buffer back to the back-end so that the back-end
+ could get inflight I/O after a crash or restart.
``VHOST_USER_GPU_SET_SOCKET``
:id: 33
:equivalent ioctl: N/A
- :master payload: N/A
+ :request payload: N/A
+ :reply payload: N/A
Sets the GPU protocol socket file descriptor, which is passed as
- ancillary data. The GPU protocol is used to inform the master of
+ ancillary data. The GPU protocol is used to inform the front-end of
rendering state and updates. See vhost-user-gpu.rst for details.
``VHOST_USER_RESET_DEVICE``
:id: 34
:equivalent ioctl: N/A
- :master payload: N/A
- :slave payload: N/A
+ :request payload: N/A
+ :reply payload: N/A
- Ask the vhost user backend to disable all rings and reset all
+ Ask the vhost user back-end to disable all rings and reset all
internal device state to the initial state, ready to be
- reinitialized. The backend retains ownership of the device
+ reinitialized. The back-end retains ownership of the device
throughout the reset operation.
Only valid if the ``VHOST_USER_PROTOCOL_F_RESET_DEVICE`` protocol
- feature is set by the backend.
+ feature is set by the back-end.
``VHOST_USER_VRING_KICK``
:id: 35
:equivalent ioctl: N/A
- :slave payload: vring state description
- :master payload: N/A
+ :request payload: vring state description
+ :reply payload: N/A
When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
feature has been successfully negotiated, this message may be
- submitted by the master to indicate that a buffer was added to
+ submitted by the front-end to indicate that a buffer was added to
the vring instead of signalling it using the vring's kick file
- descriptor or having the slave rely on polling.
+ descriptor or having the back-end rely on polling.
The state.num field is currently reserved and must be set to 0.
``VHOST_USER_GET_MAX_MEM_SLOTS``
:id: 36
:equivalent ioctl: N/A
- :slave payload: u64
+ :request payload: N/A
+ :reply payload: u64
When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
feature has been successfully negotiated, this message is submitted
- by master to the slave. The slave should return the message with a
+ by the front-end to the back-end. The back-end should return the message with a
u64 payload containing the maximum number of memory slots for
- QEMU to expose to the guest. The value returned by the backend
+ QEMU to expose to the guest. The value returned by the back-end
will be capped at the maximum number of ram slots which can be
supported by the target platform.
``VHOST_USER_ADD_MEM_REG``
:id: 37
:equivalent ioctl: N/A
- :slave payload: single memory region description
+ :request payload: N/A
+ :reply payload: single memory region description
When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
feature has been successfully negotiated, this message is submitted
- by the master to the slave. The message payload contains a memory
+ by the front-end to the back-end. The message payload contains a memory
region descriptor struct, describing a region of guest memory which
- the slave device must map in. When the
+ the back-end device must map in. When the
``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
been successfully negotiated, along with the
``VHOST_USER_REM_MEM_REG`` message, this message is used to set and
- update the memory tables of the slave device.
+ update the memory tables of the back-end device.
+
+ Exactly one file descriptor from which the memory is mapped is
+ passed in the ancillary data.
+
+ In postcopy mode (see ``VHOST_USER_POSTCOPY_LISTEN``), the back-end
+ replies with the bases of the memory mapped region to the front-end.
+ For further details on postcopy, see ``VHOST_USER_SET_MEM_TABLE``.
+ They apply to ``VHOST_USER_ADD_MEM_REG`` accordingly.
``VHOST_USER_REM_MEM_REG``
:id: 38
:equivalent ioctl: N/A
- :slave payload: single memory region description
+ :request payload: N/A
+ :reply payload: single memory region description
When the ``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol
feature has been successfully negotiated, this message is submitted
- by the master to the slave. The message payload contains a memory
+ by the front-end to the back-end. The message payload contains a memory
region descriptor struct, describing a region of guest memory which
- the slave device must unmap. When the
+ the back-end device must unmap. When the
``VHOST_USER_PROTOCOL_F_CONFIGURE_MEM_SLOTS`` protocol feature has
been successfully negotiated, along with the
``VHOST_USER_ADD_MEM_REG`` message, this message is used to set and
- update the memory tables of the slave device.
+ update the memory tables of the back-end device.
+
+ The memory region to be removed is identified by its guest address,
+ user address and size. The mmap offset is ignored.
+
+ No file descriptors SHOULD be passed in the ancillary data. For
+ compatibility with existing incorrect implementations, the back-end MAY
+ accept messages with one file descriptor. If a file descriptor is
+ passed, the back-end MUST close it without using it otherwise.
``VHOST_USER_SET_STATUS``
:id: 39
:equivalent ioctl: VHOST_VDPA_SET_STATUS
- :slave payload: N/A
- :master payload: ``u64``
+ :request payload: ``u64``
+ :reply payload: N/A
When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
- successfully negotiated, this message is submitted by the master to
- notify the backend with updated device status as defined in the Virtio
+ successfully negotiated, this message is submitted by the front-end to
+ notify the back-end with updated device status as defined in the Virtio
specification.
``VHOST_USER_GET_STATUS``
:id: 40
:equivalent ioctl: VHOST_VDPA_GET_STATUS
- :slave payload: ``u64``
- :master payload: N/A
+ :request payload: N/A
+ :reply payload: ``u64``
When the ``VHOST_USER_PROTOCOL_F_STATUS`` protocol feature has been
- successfully negotiated, this message is submitted by the master to
- query the backend for its device status as defined in the Virtio
+ successfully negotiated, this message is submitted by the front-end to
+ query the back-end for its device status as defined in the Virtio
specification.
+``VHOST_USER_GET_SHARED_OBJECT``
+ :id: 41
+ :equivalent ioctl: N/A
+ :request payload: ``struct VhostUserShared``
+ :reply payload: dmabuf fd
+
+ When the ``VHOST_USER_PROTOCOL_F_SHARED_OBJECT`` protocol
+ feature has been successfully negotiated, and the UUID is found
+ in the exporters cache, this message is submitted by the front-end
+ to retrieve a given dma-buf fd from a given back-end, determined by
+ the requested UUID. Back-end will reply passing the fd when the operation
+ is successful, or no fd otherwise.
+
+``VHOST_USER_SET_DEVICE_STATE_FD``
+ :id: 42
+ :equivalent ioctl: N/A
+ :request payload: device state transfer parameters
+ :reply payload: ``u64``
+
+ Front-end and back-end negotiate a channel over which to transfer the
+ back-end’s internal state during migration. Either side (front-end or
+ back-end) may create the channel. The nature of this channel is not
+ restricted or defined in this document, but whichever side creates it
+ must create a file descriptor that is provided to the respectively
+ other side, allowing access to the channel. This FD must behave as
+ follows:
+
+ * For the writing end, it must allow writing the whole back-end state
+ sequentially. Closing the file descriptor signals the end of
+ transfer.
+
+ * For the reading end, it must allow reading the whole back-end state
+ sequentially. The end of file signals the end of the transfer.
+
+ For example, the channel may be a pipe, in which case the two ends of
+ the pipe fulfill these requirements respectively.
+
+ Initially, the front-end creates a channel along with such an FD. It
+ passes the FD to the back-end as ancillary data of a
+ ``VHOST_USER_SET_DEVICE_STATE_FD`` message. The back-end may create a
+ different transfer channel, passing the respective FD back to the
+ front-end as ancillary data of the reply. If so, the front-end must
+ then discard its channel and use the one provided by the back-end.
+
+ Whether the back-end should decide to use its own channel is decided
+ based on efficiency: If the channel is a pipe, both ends will most
+ likely need to copy data into and out of it. Any channel that allows
+ for more efficient processing on at least one end, e.g. through
+ zero-copy, is considered more efficient and thus preferred. If the
+ back-end can provide such a channel, it should decide to use it.
+
+ The request payload contains parameters for the subsequent data
+ transfer, as described in the :ref:`Migrating back-end state
+ <migrating_backend_state>` section.
+
+ The value returned is both an indication for success, and whether a
+ file descriptor for a back-end-provided channel is returned: Bits 0–7
+ are 0 on success, and non-zero on error. Bit 8 is the invalid FD
+ flag; this flag is set when there is no file descriptor returned.
+ When this flag is not set, the front-end must use the returned file
+ descriptor as its end of the transfer channel. The back-end must not
+ both indicate an error and return a file descriptor.
+
+ Using this function requires prior negotiation of the
+ ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
+
+``VHOST_USER_CHECK_DEVICE_STATE``
+ :id: 43
+ :equivalent ioctl: N/A
+ :request payload: N/A
+ :reply payload: ``u64``
+
+ After transferring the back-end’s internal state during migration (see
+ the :ref:`Migrating back-end state <migrating_backend_state>`
+ section), check whether the back-end was able to successfully fully
+ process the state.
+
+ The value returned indicates success or error; 0 is success, any
+ non-zero value is an error.
+
+ Using this function requires prior negotiation of the
+ ``VHOST_USER_PROTOCOL_F_DEVICE_STATE`` feature.
-Slave message types
--------------------
+Back-end message types
+----------------------
+
+For this type of message, the request is sent by the back-end and the reply
+is sent by the front-end.
-``VHOST_USER_SLAVE_IOTLB_MSG``
+``VHOST_USER_BACKEND_IOTLB_MSG`` (previous name ``VHOST_USER_SLAVE_IOTLB_MSG``)
:id: 1
:equivalent ioctl: N/A (equivalent to ``VHOST_IOTLB_MSG`` message type)
- :slave payload: ``struct vhost_iotlb_msg``
- :master payload: N/A
+ :request payload: ``struct vhost_iotlb_msg``
+ :reply payload: N/A
Send IOTLB messages with ``struct vhost_iotlb_msg`` as payload.
- Slave sends such requests to notify of an IOTLB miss, or an IOTLB
+ The back-end sends such requests to notify of an IOTLB miss, or an IOTLB
access failure. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is
- negotiated, and slave set the ``VHOST_USER_NEED_REPLY`` flag, master
+ negotiated, and back-end set the ``VHOST_USER_NEED_REPLY`` flag, the front-end
must respond with zero when operation is successfully completed, or
non-zero otherwise. This request should be send only when
``VIRTIO_F_IOMMU_PLATFORM`` feature has been successfully
negotiated.
-``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``
+``VHOST_USER_BACKEND_CONFIG_CHANGE_MSG`` (previous name ``VHOST_USER_SLAVE_CONFIG_CHANGE_MSG``)
:id: 2
:equivalent ioctl: N/A
- :slave payload: N/A
- :master payload: N/A
+ :request payload: N/A
+ :reply payload: N/A
When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, vhost-user
- slave sends such messages to notify that the virtio device's
+ back-end sends such messages to notify that the virtio device's
configuration space has changed, for those host devices which can
support such feature, host driver can send ``VHOST_USER_GET_CONFIG``
- message to slave to get the latest content. If
- ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and slave set the
- ``VHOST_USER_NEED_REPLY`` flag, master must respond with zero when
+ message to the back-end to get the latest content. If
+ ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and the back-end sets the
+ ``VHOST_USER_NEED_REPLY`` flag, the front-end must respond with zero when
operation is successfully completed, or non-zero otherwise.
-``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``
+``VHOST_USER_BACKEND_VRING_HOST_NOTIFIER_MSG`` (previous name ``VHOST_USER_SLAVE_VRING_HOST_NOTIFIER_MSG``)
:id: 3
:equivalent ioctl: N/A
- :slave payload: vring area description
- :master payload: N/A
+ :request payload: vring area description
+ :reply payload: N/A
Sets host notifier for a specified queue. The queue index is
contained in the ``u64`` field of the vring area description. The
@@ -1400,7 +1776,7 @@ Slave message types
description. QEMU can mmap the file descriptor based on the size and
offset to get a memory range. Registering a host notifier means
mapping this memory range to the VM as the specified queue's notify
- MMIO region. Slave sends this request to tell QEMU to de-register
+ MMIO region. The back-end sends this request to tell QEMU to de-register
the existing notifier if any and register the new notifier if the
request is sent with a file descriptor.
@@ -1408,34 +1784,81 @@ Slave message types
``VHOST_USER_PROTOCOL_F_HOST_NOTIFIER`` protocol feature has been
successfully negotiated.
-``VHOST_USER_SLAVE_VRING_CALL``
+``VHOST_USER_BACKEND_VRING_CALL`` (previous name ``VHOST_USER_SLAVE_VRING_CALL``)
:id: 4
:equivalent ioctl: N/A
- :slave payload: vring state description
- :master payload: N/A
+ :request payload: vring state description
+ :reply payload: N/A
When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
feature has been successfully negotiated, this message may be
- submitted by the slave to indicate that a buffer was used from
+ submitted by the back-end to indicate that a buffer was used from
the vring instead of signalling this using the vring's call file
- descriptor or having the master relying on polling.
+ descriptor or having the front-end relying on polling.
The state.num field is currently reserved and must be set to 0.
-``VHOST_USER_SLAVE_VRING_ERR``
+``VHOST_USER_BACKEND_VRING_ERR`` (previous name ``VHOST_USER_SLAVE_VRING_ERR``)
:id: 5
:equivalent ioctl: N/A
- :slave payload: vring state description
- :master payload: N/A
+ :request payload: vring state description
+ :reply payload: N/A
When the ``VHOST_USER_PROTOCOL_F_INBAND_NOTIFICATIONS`` protocol
feature has been successfully negotiated, this message may be
- submitted by the slave to indicate that an error occurred on the
+ submitted by the back-end to indicate that an error occurred on the
specific vring, instead of signalling the error file descriptor
- set by the master via ``VHOST_USER_SET_VRING_ERR``.
+ set by the front-end via ``VHOST_USER_SET_VRING_ERR``.
The state.num field is currently reserved and must be set to 0.
+``VHOST_USER_BACKEND_SHARED_OBJECT_ADD``
+ :id: 6
+ :equivalent ioctl: N/A
+ :request payload: ``struct VhostUserShared``
+ :reply payload: N/A
+
+ When the ``VHOST_USER_PROTOCOL_F_SHARED_OBJECT`` protocol
+ feature has been successfully negotiated, this message can be submitted
+ by the backends to add themselves as exporters to the virtio shared lookup
+ table. The back-end device gets associated with a UUID in the shared table.
+ The back-end is responsible of keeping its own table with exported dma-buf fds.
+ When another back-end tries to import the resource associated with the UUID,
+ it will send a message to the front-end, which will act as a proxy to the
+ exporter back-end. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and
+ the back-end sets the ``VHOST_USER_NEED_REPLY`` flag, the front-end must
+ respond with zero when operation is successfully completed, or non-zero
+ otherwise.
+
+``VHOST_USER_BACKEND_SHARED_OBJECT_REMOVE``
+ :id: 7
+ :equivalent ioctl: N/A
+ :request payload: ``struct VhostUserShared``
+ :reply payload: N/A
+
+ When the ``VHOST_USER_PROTOCOL_F_SHARED_OBJECT`` protocol
+ feature has been successfully negotiated, this message can be submitted
+ by the backend to remove themselves from to the virtio-dmabuf shared
+ table API. Only the back-end owning the entry (i.e., the one that first added
+ it) will have permission to remove it. Otherwise, the message is ignored.
+ The shared table will remove the back-end device associated with
+ the UUID. If ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` is negotiated, and the
+ back-end sets the ``VHOST_USER_NEED_REPLY`` flag, the front-end must respond
+ with zero when operation is successfully completed, or non-zero otherwise.
+
+``VHOST_USER_BACKEND_SHARED_OBJECT_LOOKUP``
+ :id: 8
+ :equivalent ioctl: N/A
+ :request payload: ``struct VhostUserShared``
+ :reply payload: dmabuf fd and ``u64``
+
+ When the ``VHOST_USER_PROTOCOL_F_SHARED_OBJECT`` protocol
+ feature has been successfully negotiated, this message can be submitted
+ by the backends to retrieve a given dma-buf fd from the virtio-dmabuf
+ shared table given a UUID. Frontend will reply passing the fd and a zero
+ when the operation is successful, or non-zero otherwise. Note that if the
+ operation fails, no fd is sent to the backend.
+
.. _reply_ack:
VHOST_USER_PROTOCOL_F_REPLY_ACK
@@ -1443,21 +1866,21 @@ VHOST_USER_PROTOCOL_F_REPLY_ACK
The original vhost-user specification only demands replies for certain
commands. This differs from the vhost protocol implementation where
-commands are sent over an ``ioctl()`` call and block until the client
+commands are sent over an ``ioctl()`` call and block until the back-end
has completed.
With this protocol extension negotiated, the sender (QEMU) can set the
``need_reply`` [Bit 3] flag to any command. This indicates that the
-client MUST respond with a Payload ``VhostUserMsg`` indicating success
+back-end MUST respond with a Payload ``VhostUserMsg`` indicating success
or failure. The payload should be set to zero on success or non-zero
on failure, unless the message already has an explicit reply body.
-The response payload gives QEMU a deterministic indication of the result
+The reply payload gives QEMU a deterministic indication of the result
of the command. Today, QEMU is expected to terminate the main vhost-user
loop upon receiving such errors. In future, qemu could be taught to be more
resilient for selective requests.
-For the message types that already solicit a reply from the client,
+For the message types that already solicit a reply from the back-end,
the presence of ``VHOST_USER_PROTOCOL_F_REPLY_ACK`` or need_reply bit
being set brings no behavioural change. (See the Communication_
section for details.)
@@ -1467,26 +1890,26 @@ section for details.)
Backend program conventions
===========================
-vhost-user backends can provide various devices & services and may
+vhost-user back-ends can provide various devices & services and may
need to be configured manually depending on the use case. However, it
is a good idea to follow the conventions listed here when
possible. Users, QEMU or libvirt, can then rely on some common
behaviour to avoid heterogeneous configuration and management of the
-backend programs and facilitate interoperability.
+back-end programs and facilitate interoperability.
-Each backend installed on a host system should come with at least one
+Each back-end installed on a host system should come with at least one
JSON file that conforms to the vhost-user.json schema. Each file
-informs the management applications about the backend type, and binary
+informs the management applications about the back-end type, and binary
location. In addition, it defines rules for management apps for
-picking the highest priority backend when multiple match the search
+picking the highest priority back-end when multiple match the search
criteria (see ``@VhostUserBackend`` documentation in the schema file).
-If the backend is not capable of enabling a requested feature on the
+If the back-end is not capable of enabling a requested feature on the
host (such as 3D acceleration with virgl), or the initialization
-failed, the backend should fail to start early and exit with a status
+failed, the back-end should fail to start early and exit with a status
!= 0. It may also print a message to stderr for further details.
-The backend program must not daemonize itself, but it may be
+The back-end program must not daemonize itself, but it may be
daemonized by the management layer. It may also have a restricted
access to the system.
@@ -1494,7 +1917,7 @@ File descriptors 0, 1 and 2 will exist, and have regular
stdin/stdout/stderr usage (they may have been redirected to /dev/null
by the management layer, or to a log handler).
-The backend program must end (as quickly and cleanly as possible) when
+The back-end program must end (as quickly and cleanly as possible) when
the SIGTERM signal is received. Eventually, it may receive SIGKILL by
the management layer after a few seconds.
@@ -1508,15 +1931,15 @@ are mandatory, unless explicitly said differently:
--fd=FDNUM
- When this argument is given, the backend program is started with the
+ When this argument is given, the back-end program is started with the
vhost-user socket as file descriptor FDNUM. It is incompatible with
--socket-path.
--print-capabilities
- Output to stdout the backend capabilities in JSON format, and then
+ Output to stdout the back-end capabilities in JSON format, and then
exit successfully. Other options and arguments should be ignored, and
- the backend program should not perform its normal function. The
+ the back-end program should not perform its normal function. The
capabilities can be reported dynamically depending on the host
capabilities.
diff --git a/docs/virtio-balloon-stats.txt b/docs/interop/virtio-balloon-stats.rst
index 1732cc8c8a..b9a6a6edb2 100644
--- a/docs/virtio-balloon-stats.txt
+++ b/docs/interop/virtio-balloon-stats.rst
@@ -1,4 +1,4 @@
-virtio balloon memory statistics
+Virtio balloon memory statistics
================================
The virtio balloon driver supports guest memory statistics reporting. These
@@ -9,10 +9,12 @@ Before querying the available stats, clients first have to enable polling.
This is done by writing a time interval value (in seconds) to the
guest-stats-polling-interval property. This value can be:
- > 0 enables polling in the specified interval. If polling is already
+ > 0
+ enables polling in the specified interval. If polling is already
enabled, the polling time interval is changed to the new value
- 0 disables polling. Previous polled statistics are still valid and
+ 0
+ disables polling. Previous polled statistics are still valid and
can be queried.
Once polling is enabled, the virtio-balloon device in QEMU will start
@@ -22,7 +24,7 @@ interval.
To retrieve those stats, clients have to query the guest-stats property,
which will return a dictionary containing:
- o A key named 'stats', containing all available stats. If the guest
+ * A key named 'stats', containing all available stats. If the guest
doesn't support a particular stat, or if it couldn't be retrieved,
its value will be -1. Currently, the following stats are supported:
@@ -37,7 +39,7 @@ which will return a dictionary containing:
- stat-htlb-pgalloc
- stat-htlb-pgfail
- o A key named last-update, which contains the last stats update
+ * A key named last-update, which contains the last stats update
timestamp in seconds. Since this timestamp is generated by the host,
a buggy guest can't influence its value. The value is 0 if the guest
has not updated the stats (yet).
@@ -61,32 +63,32 @@ It's also important to note the following:
respond to the request the timer will never be re-armed, which has
the same effect as disabling polling
-Here are a few examples. QEMU is started with '-device virtio-balloon',
-which generates '/machine/peripheral-anon/device[1]' as the QOM path for
+Here are a few examples. QEMU is started with ``-device virtio-balloon``,
+which generates ``/machine/peripheral-anon/device[1]`` as the QOM path for
the balloon device.
-Enable polling with 2 seconds interval:
+Enable polling with 2 seconds interval::
-{ "execute": "qom-set",
- "arguments": { "path": "/machine/peripheral-anon/device[1]",
- "property": "guest-stats-polling-interval", "value": 2 } }
+ { "execute": "qom-set",
+ "arguments": { "path": "/machine/peripheral-anon/device[1]",
+ "property": "guest-stats-polling-interval", "value": 2 } }
-{ "return": {} }
+ { "return": {} }
-Change polling to 10 seconds:
+Change polling to 10 seconds::
-{ "execute": "qom-set",
- "arguments": { "path": "/machine/peripheral-anon/device[1]",
- "property": "guest-stats-polling-interval", "value": 10 } }
+ { "execute": "qom-set",
+ "arguments": { "path": "/machine/peripheral-anon/device[1]",
+ "property": "guest-stats-polling-interval", "value": 10 } }
-{ "return": {} }
+ { "return": {} }
-Get stats:
+Get stats::
-{ "execute": "qom-get",
- "arguments": { "path": "/machine/peripheral-anon/device[1]",
- "property": "guest-stats" } }
-{
+ { "execute": "qom-get",
+ "arguments": { "path": "/machine/peripheral-anon/device[1]",
+ "property": "guest-stats" } }
+ {
"return": {
"stats": {
"stat-swap-out": 0,
@@ -98,12 +100,12 @@ Get stats:
},
"last-update": 1358529861
}
-}
+ }
-Disable polling:
+Disable polling::
-{ "execute": "qom-set",
- "arguments": { "path": "/machine/peripheral-anon/device[1]",
- "property": "stats-polling-interval", "value": 0 } }
+ { "execute": "qom-set",
+ "arguments": { "path": "/machine/peripheral-anon/device[1]",
+ "property": "stats-polling-interval", "value": 0 } }
-{ "return": {} }
+ { "return": {} }
diff --git a/docs/interop/vnc-ledstate-Pseudo-encoding.txt b/docs/interop/vnc-ledstate-pseudo-encoding.rst
index 0f124f68b1..0f124f68b1 100644
--- a/docs/interop/vnc-ledstate-Pseudo-encoding.txt
+++ b/docs/interop/vnc-ledstate-pseudo-encoding.rst
diff --git a/docs/meson.build b/docs/meson.build
index cffe1ecf1d..9040f860ae 100644
--- a/docs/meson.build
+++ b/docs/meson.build
@@ -1,10 +1,5 @@
-if get_option('sphinx_build') == ''
- sphinx_build = find_program(['sphinx-build-3', 'sphinx-build'],
- required: get_option('docs'))
-else
- sphinx_build = find_program(get_option('sphinx_build'),
- required: get_option('docs'))
-endif
+sphinx_build = find_program(fs.parent(python.full_path()) / 'sphinx-build',
+ required: get_option('docs'))
# Check if tools are available to build documentation.
build_docs = false
@@ -12,18 +7,30 @@ if sphinx_build.found()
SPHINX_ARGS = ['env', 'CONFDIR=' + qemu_confdir, sphinx_build, '-q']
# If we're making warnings fatal, apply this to Sphinx runs as well
if get_option('werror')
- SPHINX_ARGS += [ '-W' ]
+ SPHINX_ARGS += [ '-W', '-Dkerneldoc_werror=1' ]
+ endif
+
+ sphinx_version = run_command(SPHINX_ARGS + ['--version'],
+ check: true).stdout().split()[1]
+ if sphinx_version.version_compare('>=1.7.0')
+ SPHINX_ARGS += ['-j', 'auto']
+ else
+ nproc = find_program('nproc')
+ if nproc.found()
+ jobs = run_command(nproc, check: true).stdout()
+ SPHINX_ARGS += ['-j', jobs]
+ endif
endif
# This is a bit awkward but works: create a trivial document and
# try to run it with our configuration file (which enforces a
# version requirement). This will fail if sphinx-build is too old.
- run_command('mkdir', ['-p', tmpdir / 'sphinx'])
- run_command('touch', [tmpdir / 'sphinx/index.rst'])
+ run_command('mkdir', ['-p', tmpdir / 'sphinx'], check: true)
+ run_command('touch', [tmpdir / 'sphinx/index.rst'], check: true)
sphinx_build_test_out = run_command(SPHINX_ARGS + [
'-c', meson.current_source_dir(),
'-b', 'html', tmpdir / 'sphinx',
- tmpdir / 'sphinx/out'])
+ tmpdir / 'sphinx/out'], check: false)
build_docs = (sphinx_build_test_out.returncode() == 0)
if not build_docs
@@ -35,18 +42,7 @@ if sphinx_build.found()
endif
if build_docs
- SPHINX_ARGS += ['-Dversion=' + meson.project_version(), '-Drelease=' + config_host['PKGVERSION']]
-
- sphinx_extn_depends = [ meson.source_root() / 'docs/sphinx/depfile.py',
- meson.source_root() / 'docs/sphinx/hxtool.py',
- meson.source_root() / 'docs/sphinx/kerneldoc.py',
- meson.source_root() / 'docs/sphinx/kernellog.py',
- meson.source_root() / 'docs/sphinx/qapidoc.py',
- meson.source_root() / 'docs/sphinx/qmp_lexer.py',
- qapi_gen_depends ]
- sphinx_template_files = [ meson.source_root() / 'docs/_templates/footer.html' ]
-
- have_ga = have_tools and config_host.has_key('CONFIG_GUEST_AGENT')
+ SPHINX_ARGS += ['-Dversion=' + meson.project_version(), '-Drelease=' + get_option('pkgversion')]
man_pages = {
'qemu-ga.8': (have_ga ? 'man8' : ''),
@@ -57,9 +53,8 @@ if build_docs
'qemu-nbd.8': (have_tools ? 'man8' : ''),
'qemu-pr-helper.8': (have_tools ? 'man8' : ''),
'qemu-storage-daemon.1': (have_tools ? 'man1' : ''),
- 'qemu-trace-stap.1': (config_host.has_key('CONFIG_TRACE_SYSTEMTAP') ? 'man1' : ''),
+ 'qemu-trace-stap.1': (stap.found() ? 'man1' : ''),
'virtfs-proxy-helper.1': (have_virtfs_proxy_helper ? 'man1' : ''),
- 'virtiofsd.1': (have_virtiofsd ? 'man1' : ''),
'qemu.1': 'man1',
'qemu-block-drivers.7': 'man7',
'qemu-cpu-models.7': 'man7'
@@ -77,7 +72,6 @@ if build_docs
output: 'docs.stamp',
input: files('conf.py'),
depfile: 'docs.d',
- depend_files: [ sphinx_extn_depends, sphinx_template_files ],
command: [SPHINX_ARGS, '-Ddepfile=@DEPFILE@',
'-Ddepfile_stamp=@OUTPUT0@',
'-b', 'html', '-d', private_dir,
diff --git a/docs/multi-thread-compression.txt b/docs/multi-thread-compression.txt
index bb88c6bdf1..95b1556f67 100644
--- a/docs/multi-thread-compression.txt
+++ b/docs/multi-thread-compression.txt
@@ -117,13 +117,13 @@ to support the multiple thread compression migration:
{qemu} migrate_set_capability compress on
3. Set the compression thread count on source:
- {qemu} migrate_set_parameter compress_threads 12
+ {qemu} migrate_set_parameter compress-threads 12
4. Set the compression level on the source:
- {qemu} migrate_set_parameter compress_level 1
+ {qemu} migrate_set_parameter compress-level 1
5. Set the decompression thread count on destination:
- {qemu} migrate_set_parameter decompress_threads 3
+ {qemu} migrate_set_parameter decompress-threads 3
6. Start outgoing migration:
{qemu} migrate -d tcp:destination.host:4444
@@ -133,9 +133,9 @@ to support the multiple thread compression migration:
The following are the default settings:
compress: off
- compress_threads: 8
- decompress_threads: 2
- compress_level: 1 (which means best speed)
+ compress-threads: 8
+ decompress-threads: 2
+ compress-level: 1 (which means best speed)
So, only the first two steps are required to use the multiple
thread compression in migration. You can do more if the default
diff --git a/docs/multiseat.txt b/docs/multiseat.txt
index 11850c96ff..2b297e979d 100644
--- a/docs/multiseat.txt
+++ b/docs/multiseat.txt
@@ -123,7 +123,7 @@ Background info is here:
guest side with pci-bridge-seat
-------------------------------
-Qemu version 2.4 and newer has a new pci-bridge-seat device which
+QEMU version 2.4 and newer has a new pci-bridge-seat device which
can be used instead of pci-bridge. Just swap the device name in the
qemu command line above. The only difference between the two devices
is the pci id. We can match the pci id instead of the device path
diff --git a/docs/papr-pef.txt b/docs/papr-pef.txt
deleted file mode 100644
index 72550e9bf8..0000000000
--- a/docs/papr-pef.txt
+++ /dev/null
@@ -1,30 +0,0 @@
-POWER (PAPR) Protected Execution Facility (PEF)
-===============================================
-
-Protected Execution Facility (PEF), also known as Secure Guest support
-is a feature found on IBM POWER9 and POWER10 processors.
-
-If a suitable firmware including an Ultravisor is installed, it adds
-an extra memory protection mode to the CPU. The ultravisor manages a
-pool of secure memory which cannot be accessed by the hypervisor.
-
-When this feature is enabled in QEMU, a guest can use ultracalls to
-enter "secure mode". This transfers most of its memory to secure
-memory, where it cannot be eavesdropped by a compromised hypervisor.
-
-Launching
----------
-
-To launch a guest which will be permitted to enter PEF secure mode:
-
-# ${QEMU} \
- -object pef-guest,id=pef0 \
- -machine confidential-guest-support=pef0 \
- ...
-
-Live Migration
-----------------
-
-Live migration is not yet implemented for PEF guests. For
-consistency, we currently prevent migration if the PEF feature is
-enabled, whether or not the guest has actually entered secure mode.
diff --git a/docs/pcie.txt b/docs/pcie.txt
index 89e3502075..df49178311 100644
--- a/docs/pcie.txt
+++ b/docs/pcie.txt
@@ -48,8 +48,8 @@ Place only the following kinds of devices directly on the Root Complex:
strangely when PCI Express devices are integrated
with the Root Complex.
- (2) PCI Express Root Ports (ioh3420), for starting exclusively PCI Express
- hierarchies.
+ (2) PCI Express Root Ports (pcie-root-port), for starting exclusively
+ PCI Express hierarchies.
(3) PCI Express to PCI Bridge (pcie-pci-bridge), for starting legacy PCI
hierarchies.
@@ -70,7 +70,7 @@ Place only the following kinds of devices directly on the Root Complex:
-device pxb-pcie,id=pcie.1,bus_nr=x[,numa_node=y][,addr=z]
PCI Express Root Ports and PCI Express to PCI bridges can be
connected to the pcie.1 bus:
- -device ioh3420,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z] \
+ -device pcie-root-port,id=root_port1[,bus=pcie.1][,chassis=x][,slot=y][,addr=z] \
-device pcie-pci-bridge,id=pcie_pci_bridge1,bus=pcie.1
@@ -112,14 +112,14 @@ Plug only PCI Express devices into PCI Express Ports.
------------
2.2.1 Plugging a PCI Express device into a PCI Express Root Port:
- -device ioh3420,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z] \
+ -device pcie-root-port,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z] \
-device <dev>,bus=root_port1
2.2.2 Using multi-function PCI Express Root Ports:
- -device ioh3420,id=root_port1,multifunction=on,chassis=x,addr=z.0[,slot=y][,bus=pcie.0] \
- -device ioh3420,id=root_port2,chassis=x1,addr=z.1[,slot=y1][,bus=pcie.0] \
- -device ioh3420,id=root_port3,chassis=x2,addr=z.2[,slot=y2][,bus=pcie.0] \
+ -device pcie-root-port,id=root_port1,multifunction=on,chassis=x,addr=z.0[,slot=y][,bus=pcie.0] \
+ -device pcie-root-port,id=root_port2,chassis=x1,addr=z.1[,slot=y1][,bus=pcie.0] \
+ -device pcie-root-port,id=root_port3,chassis=x2,addr=z.2[,slot=y2][,bus=pcie.0] \
2.2.3 Plugging a PCI Express device into a Switch:
- -device ioh3420,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z] \
+ -device pcie-root-port,id=root_port1,chassis=x,slot=y[,bus=pcie.0][,addr=z] \
-device x3130-upstream,id=upstream_port1,bus=root_port1[,addr=x] \
-device xio3130-downstream,id=downstream_port1,bus=upstream_port1,chassis=x1,slot=y1[,addr=z1]] \
-device <dev>,bus=downstream_port1
diff --git a/docs/pcie_sriov.txt b/docs/pcie_sriov.txt
new file mode 100644
index 0000000000..a47aad0bfa
--- /dev/null
+++ b/docs/pcie_sriov.txt
@@ -0,0 +1,112 @@
+PCI SR/IOV EMULATION SUPPORT
+============================
+
+Description
+===========
+SR/IOV (Single Root I/O Virtualization) is an optional extended capability
+of a PCI Express device. It allows a single physical function (PF) to appear as multiple
+virtual functions (VFs) for the main purpose of eliminating software
+overhead in I/O from virtual machines.
+
+QEMU now implements the basic common functionality to enable an emulated device
+to support SR/IOV.
+
+Implementation
+==============
+Implementing emulation of an SR/IOV capable device typically consists of
+implementing support for two types of device classes; the "normal" physical device
+(PF) and the virtual device (VF). From QEMU's perspective, the VFs are just
+like other devices, except that some of their properties are derived from
+the PF.
+
+A virtual function is different from a physical function in that the BAR
+space for all VFs are defined by the BAR registers in the PFs SR/IOV
+capability. All VFs have the same BARs and BAR sizes.
+
+Accesses to these virtual BARs then is computed as
+
+ <VF BAR start> + <VF number> * <BAR sz> + <offset>
+
+From our emulation perspective this means that there is a separate call for
+setting up a BAR for a VF.
+
+1) To enable SR/IOV support in the PF, it must be a PCI Express device so
+ you would need to add a PCI Express capability in the normal PCI
+ capability list. You might also want to add an ARI (Alternative
+ Routing-ID Interpretation) capability to indicate that your device
+ supports functions beyond it's "own" function space (0-7),
+ which is necessary to support more than 7 functions, or
+ if functions extends beyond offset 7 because they are placed at an
+ offset > 1 or have stride > 1.
+
+ ...
+ #include "hw/pci/pcie.h"
+ #include "hw/pci/pcie_sriov.h"
+
+ pci_your_pf_dev_realize( ... )
+ {
+ ...
+ int ret = pcie_endpoint_cap_init(d, 0x70);
+ ...
+ pcie_ari_init(d, 0x100);
+ ...
+
+ /* Add and initialize the SR/IOV capability */
+ pcie_sriov_pf_init(d, 0x200, "your_virtual_dev",
+ vf_devid, initial_vfs, total_vfs,
+ fun_offset, stride);
+
+ /* Set up individual VF BARs (parameters as for normal BARs) */
+ pcie_sriov_pf_init_vf_bar( ... )
+ ...
+ }
+
+ For cleanup, you simply call:
+
+ pcie_sriov_pf_exit(device);
+
+ which will delete all the virtual functions and associated resources.
+
+2) Similarly in the implementation of the virtual function, you need to
+ make it a PCI Express device and add a similar set of capabilities
+ except for the SR/IOV capability. Then you need to set up the VF BARs as
+ subregions of the PFs SR/IOV VF BARs by calling
+ pcie_sriov_vf_register_bar() instead of the normal pci_register_bar() call:
+
+ pci_your_vf_dev_realize( ... )
+ {
+ ...
+ int ret = pcie_endpoint_cap_init(d, 0x60);
+ ...
+ pcie_ari_init(d, 0x100);
+ ...
+ memory_region_init(mr, ... )
+ pcie_sriov_vf_register_bar(d, bar_nr, mr);
+ ...
+ }
+
+Testing on Linux guest
+======================
+The easiest is if your device driver supports sysfs based SR/IOV
+enabling. Support for this was added in kernel v.3.8, so not all drivers
+support it yet.
+
+To enable 4 VFs for a device at 01:00.0:
+
+ modprobe yourdriver
+ echo 4 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
+
+You should now see 4 VFs with lspci.
+To turn SR/IOV off again - the standard requires you to turn it off before you can enable
+another VF count, and the emulation enforces this:
+
+ echo 0 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
+
+Older drivers typically provide a max_vfs module parameter
+to enable it at load time:
+
+ modprobe yourdriver max_vfs=4
+
+To disable the VFs again then, you simply have to unload the driver:
+
+ rmmod yourdriver
diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
deleted file mode 100644
index 5c122fe818..0000000000
--- a/docs/pvrdma.txt
+++ /dev/null
@@ -1,345 +0,0 @@
-Paravirtualized RDMA Device (PVRDMA)
-====================================
-
-
-1. Description
-===============
-PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
-It works with its Linux Kernel driver AS IS, no need for any special guest
-modifications.
-
-While it complies with the VMware device, it can also communicate with bare
-metal RDMA-enabled machines as peers.
-
-It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
-
-It does not require the whole guest RAM to be pinned allowing memory
-over-commit and, even if not implemented yet, migration support will be
-possible with some HW assistance.
-
-A project presentation accompany this document:
-- https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
-
-
-
-2. Setup
-========
-
-
-2.1 Guest setup
-===============
-Fedora 27+ kernels work out of the box, older distributions
-require updating the kernel to 4.14 to include the pvrdma driver.
-
-However the libpvrdma library needed by User Level Software is still
-not available as part of the distributions, so the rdma-core library
-needs to be compiled and optionally installed.
-
-Please follow the instructions at:
- https://github.com/linux-rdma/rdma-core.git
-
-
-2.2 Host Setup
-==============
-The pvrdma backend is an ibdevice interface that can be exposed
-either by a Soft-RoCE(rxe) device on machines with no RDMA device,
-or an HCA SRIOV function(VF/PF).
-Note that ibdevice interfaces can't be shared between pvrdma devices,
-each one requiring a separate instance (rxe or SRIOV VF).
-
-
-2.2.1 Soft-RoCE backend(rxe)
-===========================
-A stable version of rxe is required, Fedora 27+ or a Linux
-Kernel 4.14+ is preferred.
-
-The rdma_rxe module is part of the Linux Kernel but not loaded by default.
-Install the User Level library (librxe) following the instructions from:
-https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
-
-Associate an ETH interface with rxe by running:
- rxe_cfg add eth0
-An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
-
-
-2.2.2 RDMA device Virtual Function backend
-==========================================
-Nothing special is required, the pvrdma device can work not only with
-Ethernet Links, but also Infinibands Links.
-All is needed is an ibdevice with an active port, for Mellanox cards
-will be something like mlx5_6 which can be the backend.
-
-
-2.2.3 QEMU setup
-================
-Configure QEMU with --enable-rdma flag, installing
-the required RDMA libraries.
-
-
-
-3. Usage
-========
-
-
-3.1 VM Memory settings
-======================
-Currently the device is working only with memory backed RAM
-and it must be mark as "shared":
- -m 1G \
- -object memory-backend-ram,id=mb1,size=1G,share \
- -numa node,memdev=mb1 \
-
-
-3.2 MAD Multiplexer
-===================
-MAD Multiplexer is a service that exposes MAD-like interface for VMs in
-order to overcome the limitation where only single entity can register with
-MAD layer to send and receive RDMA-CM MAD packets.
-
-To build rdmacm-mux run
-# make rdmacm-mux
-
-Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel
-modules aren't loaded, otherwise the rdmacm-mux service will fail to start.
-
-The application accepts 3 command line arguments and exposes a UNIX socket
-to pass control and data to it.
--d rdma-device-name Name of RDMA device to register with
--s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux)
--p rdma-device-port Port number of RDMA device to register with (default 1)
-The final UNIX socket file name is a concatenation of the 3 arguments so
-for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
-will be created.
-
-pvrdma requires this service.
-
-Please refer to contrib/rdmacm-mux for more details.
-
-
-3.3 Service exposed by libvirt daemon
-=====================================
-The control over the RDMA device's GID table is done by updating the
-device's Ethernet function addresses.
-Usually the first GID entry is determined by the MAC address, the second by
-the first IPv6 address and the third by the IPv4 address. Other entries can
-be added by adding more IP addresses. The opposite is the same, i.e.
-whenever an address is removed, the corresponding GID entry is removed.
-The process is done by the network and RDMA stacks. Whenever an address is
-added the ib_core driver is notified and calls the device driver add_gid
-function which in turn update the device.
-To support this in pvrdma device the device hooks into the create_bind and
-destroy_bind HW commands triggered by pvrdma driver in guest.
-
-Whenever changed is made to the pvrdma port's GID table a special QMP
-messages is sent to be processed by libvirt to update the address of the
-backend Ethernet device.
-
-pvrdma requires that libvirt service will be up.
-
-
-3.4 PCI devices settings
-========================
-RoCE device exposes two functions - an Ethernet and RDMA.
-To support it, pvrdma device is composed of two PCI functions, an Ethernet
-device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
-Ethernet function can be used for other Ethernet purposes such as IP.
-
-
-3.5 Device parameters
-=====================
-- netdev: Specifies the Ethernet device function name on the host for
- example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
- device used to create it.
-- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
-- mad-chardev: The name of the MAD multiplexer char device.
-- ibport: In case of multi-port device (such as Mellanox's HCA) this
- specify the port to use. If not set 1 will be used.
-- dev-caps-max-mr-size: The maximum size of MR.
-- dev-caps-max-qp: Maximum number of QPs.
-- dev-caps-max-cq: Maximum number of CQs.
-- dev-caps-max-mr: Maximum number of MRs.
-- dev-caps-max-pd: Maximum number of PDs.
-- dev-caps-max-ah: Maximum number of AHs.
-
-Notes:
-- The first 3 parameters are mandatory settings, the rest have their
- defaults.
-- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
- limits but the final values is adjusted by the backend device limitations.
-- netdev can be extracted from ibdev's sysfs
- (/sys/class/infiniband/<ibdev>/device/net/)
-
-
-3.6 Example
-===========
-Define bridge device with vmxnet3 network backend:
-<interface type='bridge'>
- <mac address='56:b4:44:e9:62:dc'/>
- <source bridge='bridge1'/>
- <model type='vmxnet3'/>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
-</interface>
-
-Define pvrdma device:
-<qemu:commandline>
- <qemu:arg value='-object'/>
- <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
- <qemu:arg value='-numa'/>
- <qemu:arg value='node,memdev=mb1'/>
- <qemu:arg value='-chardev'/>
- <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
- <qemu:arg value='-device'/>
- <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
-</qemu:commandline>
-
-
-
-4. Implementation details
-=========================
-
-
-4.1 Overview
-============
-The device acts like a proxy between the Guest Driver and the host
-ibdevice interface.
-On configuration path:
- - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
- a resource from the backend interface, maintaining a 1-1 mapping
- between the guest and host.
-On data path:
- - Every post_send/receive received from the guest will be converted into
- a post_send/receive for the backend. The buffers data will not be touched
- or copied resulting in near bare-metal performance for large enough buffers.
- - Completions from the backend interface will result in completions for
- the pvrdma device.
-
-
-4.2 PCI BARs
-============
-PCI Bars:
- BAR 0 - MSI-X
- MSI-X vectors:
- (0) Command - used when execution of a command is completed.
- (1) Async - not in use.
- (2) Completion - used when a completion event is placed in
- device's CQ ring.
- BAR 1 - Registers
- --------------------------------------------------------
- | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC |
- --------------------------------------------------------
- DSR - Address of driver/device shared memory used
- for the command channel, used for passing:
- - General info such as driver version
- - Address of 'command' and 'response'
- - Address of async ring
- - Address of device's CQ ring
- - Device capabilities
- CTL - Device control operations (activate, reset etc)
- IMG - Set interrupt mask
- REQ - Command execution register
- ERR - Operation status
-
- BAR 2 - UAR
- ---------------------------------------------------------
- | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag |
- ---------------------------------------------------------
- - Offset 0 used for QP operations (send and recv)
- - Offset 4 used for CQ operations (arm and poll)
-
-
-4.3 Major flows
-===============
-
-4.3.1 Create CQ
-===============
- - Guest driver
- - Allocates pages for CQ ring
- - Creates page directory (pdir) to hold CQ ring's pages
- - Initializes CQ ring
- - Initializes 'Create CQ' command object (cqe, pdir etc)
- - Copies the command to 'command' address
- - Writes 0 into REQ register
- - Device
- - Reads the request object from the 'command' address
- - Allocates CQ object and initialize CQ ring based on pdir
- - Creates the backend CQ
- - Writes operation status to ERR register
- - Posts command-interrupt to guest
- - Guest driver
- - Reads the HW response code from ERR register
-
-4.3.2 Create QP
-===============
- - Guest driver
- - Allocates pages for send and receive rings
- - Creates page directory(pdir) to hold the ring's pages
- - Initializes 'Create QP' command object (max_send_wr,
- send_cq_handle, recv_cq_handle, pdir etc)
- - Copies the object to 'command' address
- - Write 0 into REQ register
- - Device
- - Reads the request object from 'command' address
- - Allocates the QP object and initialize
- - Send and recv rings based on pdir
- - Send and recv ring state
- - Creates the backend QP
- - Writes the operation status to ERR register
- - Posts command-interrupt to guest
- - Guest driver
- - Reads the HW response code from ERR register
-
-4.3.3 Post receive
-==================
- - Guest driver
- - Initializes a wqe and place it on recv ring
- - Write to qpn|qp_recv_bit (31) to QP offset in UAR
- - Device
- - Extracts qpn from UAR
- - Walks through the ring and does the following for each wqe
- - Prepares the backend CQE context to be used when
- receiving completion from backend (wr_id, op_code, emu_cq_num)
- - For each sge prepares backend sge
- - Calls backend's post_recv
-
-4.3.4 Process backend events
-============================
- - Done by a dedicated thread used to process backend events;
- at initialization is attached to the device and creates
- the communication channel.
- - Thread main loop:
- - Polls for completions
- - Extracts QEMU _cq_num, wr_id and op_code from context
- - Writes CQE to CQ ring
- - Writes CQ number to device CQ
- - Sends completion-interrupt to guest
- - Deallocates context
- - Acks the event to backend
-
-
-
-5. Limitations
-==============
-- The device obviously is limited by the Guest Linux Driver features implementation
- of the VMware device API.
-- Memory registration mechanism requires mremap for every page in the buffer in order
- to map it to a contiguous virtual address range. Since this is not the data path
- it should not matter much. If the default max mr size is increased, be aware that
- memory registration can take up to 0.5 seconds for 1GB of memory.
-- The device requires target page size to be the same as the host page size,
- otherwise it will fail to init.
-- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
- so it can't work with huge pages. The limitation will be addressed in the future,
- however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
- pages available, QEMU will use them. QEMU will fail to init if the requirements
- are not met.
-
-
-
-6. Performance
-==============
-By design the pvrdma device exits on each post-send/receive, so for small buffers
-the performance is affected; however for medium buffers it will became close to
-bare metal and from 1MB buffers and up it reaches bare metal performance.
-(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
-
-All the above assumes no memory registration is done on data path.
diff --git a/docs/qdev-device-use.txt b/docs/qdev-device-use.txt
index 2408889334..c98c86d828 100644
--- a/docs/qdev-device-use.txt
+++ b/docs/qdev-device-use.txt
@@ -216,11 +216,11 @@ LEGACY-CHARDEV translates to -chardev HOST-OPTS... as follows:
* unix:FNAME becomes -chardev socket,path=FNAME
-* /dev/parportN becomes -chardev parport,file=/dev/parportN
+* /dev/parportN becomes -chardev parallel,file=/dev/parportN
* /dev/ppiN likewise
-* Any other /dev/FNAME becomes -chardev tty,path=/dev/FNAME
+* Any other /dev/FNAME becomes -chardev serial,path=/dev/FNAME
* mon:LEGACY-CHARDEV is special: it multiplexes the monitor onto the
character device defined by LEGACY-CHARDEV. -chardev provides more
diff --git a/docs/qemu_logo.pdf b/docs/qemu_logo.pdf
deleted file mode 100644
index 294cb7dec5..0000000000
--- a/docs/qemu_logo.pdf
+++ /dev/null
Binary files differ
diff --git a/docs/rdma.txt b/docs/rdma.txt
index 2b4cdea1d8..bd8dd799a9 100644
--- a/docs/rdma.txt
+++ b/docs/rdma.txt
@@ -89,7 +89,7 @@ RUNNING:
First, set the migration speed to match your hardware's capabilities:
QEMU Monitor Command:
-$ migrate_set_parameter max_bandwidth 40g # or whatever is the MAX of your RDMA device
+$ migrate_set_parameter max-bandwidth 40g # or whatever is the MAX of your RDMA device
Next, on the destination machine, add the following to the QEMU command line:
diff --git a/docs/replay.txt b/docs/replay.txt
deleted file mode 100644
index 5b008ca491..0000000000
--- a/docs/replay.txt
+++ /dev/null
@@ -1,410 +0,0 @@
-Copyright (c) 2010-2015 Institute for System Programming
- of the Russian Academy of Sciences.
-
-This work is licensed under the terms of the GNU GPL, version 2 or later.
-See the COPYING file in the top-level directory.
-
-Record/replay
--------------
-
-Record/replay functions are used for the deterministic replay of qemu execution.
-Execution recording writes a non-deterministic events log, which can be later
-used for replaying the execution anywhere and for unlimited number of times.
-It also supports checkpointing for faster rewind to the specific replay moment.
-Execution replaying reads the log and replays all non-deterministic events
-including external input, hardware clocks, and interrupts.
-
-Deterministic replay has the following features:
- * Deterministically replays whole system execution and all contents of
- the memory, state of the hardware devices, clocks, and screen of the VM.
- * Writes execution log into the file for later replaying for multiple times
- on different machines.
- * Supports i386, x86_64, and Arm hardware platforms.
- * Performs deterministic replay of all operations with keyboard and mouse
- input devices.
-
-Usage of the record/replay:
- * First, record the execution with the following command line:
- qemu-system-i386 \
- -icount shift=7,rr=record,rrfile=replay.bin \
- -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
- -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
- -device ide-hd,drive=img-blkreplay \
- -netdev user,id=net1 -device rtl8139,netdev=net1 \
- -object filter-replay,id=replay,netdev=net1
- * After recording, you can replay it by using another command line:
- qemu-system-i386 \
- -icount shift=7,rr=replay,rrfile=replay.bin \
- -drive file=disk.qcow2,if=none,snapshot,id=img-direct \
- -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \
- -device ide-hd,drive=img-blkreplay \
- -netdev user,id=net1 -device rtl8139,netdev=net1 \
- -object filter-replay,id=replay,netdev=net1
- The only difference with recording is changing the rr option
- from record to replay.
- * Block device images are not actually changed in the recording mode,
- because all of the changes are written to the temporary overlay file.
- This behavior is enabled by using blkreplay driver. It should be used
- for every enabled block device, as described in 'Block devices' section.
- * '-net none' option should be specified when network is not used,
- because QEMU adds network card by default. When network is needed,
- it should be configured explicitly with replay filter, as described
- in 'Network devices' section.
- * Interaction with audio devices and serial ports are recorded and replayed
- automatically when such devices are enabled.
-
-Academic papers with description of deterministic replay implementation:
-http://www.computer.org/csdl/proceedings/csmr/2012/4666/00/4666a553-abs.html
-http://dl.acm.org/citation.cfm?id=2786805.2803179
-
-Modifications of qemu include:
- * wrappers for clock and time functions to save their return values in the log
- * saving different asynchronous events (e.g. system shutdown) into the log
- * synchronization of the bottom halves execution
- * synchronization of the threads from thread pool
- * recording/replaying user input (mouse, keyboard, and microphone)
- * adding internal checkpoints for cpu and io synchronization
- * network filter for recording and replaying the packets
- * block driver for making block layer deterministic
- * serial port input record and replay
- * recording of random numbers obtained from the external sources
-
-Locking and thread synchronisation
-----------------------------------
-
-Previously the synchronisation of the main thread and the vCPU thread
-was ensured by the holding of the BQL. However the trend has been to
-reduce the time the BQL was held across the system including under TCG
-system emulation. As it is important that batches of events are kept
-in sequence (e.g. expiring timers and checkpoints in the main thread
-while instruction checkpoints are written by the vCPU thread) we need
-another lock to keep things in lock-step. This role is now handled by
-the replay_mutex_lock. It used to be held only for each event being
-written but now it is held for a whole execution period. This results
-in a deterministic ping-pong between the two main threads.
-
-As the BQL is now a finer grained lock than the replay_lock it is almost
-certainly a bug, and a source of deadlocks, to take the
-replay_mutex_lock while the BQL is held. This is enforced by an assert.
-While the unlocks are usually in the reverse order, this is not
-necessary; you can drop the replay_lock while holding the BQL, without
-doing a more complicated unlock_iothread/replay_unlock/lock_iothread
-sequence.
-
-Non-deterministic events
-------------------------
-
-Our record/replay system is based on saving and replaying non-deterministic
-events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
-from HDD or memory of the VM). Saving only non-deterministic events makes
-log file smaller and simulation faster.
-
-The following non-deterministic data from peripheral devices is saved into
-the log: mouse and keyboard input, network packets, audio controller input,
-serial port input, and hardware clocks (they are non-deterministic
-too, because their values are taken from the host machine). Inputs from
-simulated hardware, memory of VM, software interrupts, and execution of
-instructions are not saved into the log, because they are deterministic and
-can be replayed by simulating the behavior of virtual machine starting from
-initial state.
-
-We had to solve three tasks to implement deterministic replay: recording
-non-deterministic events, replaying non-deterministic events, and checking
-that there is no divergence between record and replay modes.
-
-We changed several parts of QEMU to make event log recording and replaying.
-Devices' models that have non-deterministic input from external devices were
-changed to write every external event into the execution log immediately.
-E.g. network packets are written into the log when they arrive into the virtual
-network adapter.
-
-All non-deterministic events are coming from these devices. But to
-replay them we need to know at which moments they occur. We specify
-these moments by counting the number of instructions executed between
-every pair of consecutive events.
-
-Instruction counting
---------------------
-
-QEMU should work in icount mode to use record/replay feature. icount was
-designed to allow deterministic execution in absence of external inputs
-of the virtual machine. We also use icount to control the occurrence of the
-non-deterministic events. The number of instructions elapsed from the last event
-is written to the log while recording the execution. In replay mode we
-can predict when to inject that event using the instruction counter.
-
-Timers
-------
-
-Timers are used to execute callbacks from different subsystems of QEMU
-at the specified moments of time. There are several kinds of timers:
- * Real time clock. Based on host time and used only for callbacks that
- do not change the virtual machine state. For this reason real time
- clock and timers does not affect deterministic replay at all.
- * Virtual clock. These timers run only during the emulation. In icount
- mode virtual clock value is calculated using executed instructions counter.
- That is why it is completely deterministic and does not have to be recorded.
- * Host clock. This clock is used by device models that simulate real time
- sources (e.g. real time clock chip). Host clock is the one of the sources
- of non-determinism. Host clock read operations should be logged to
- make the execution deterministic.
- * Virtual real time clock. This clock is similar to real time clock but
- it is used only for increasing virtual clock while virtual machine is
- sleeping. Due to its nature it is also non-deterministic as the host clock
- and has to be logged too.
-
-Checkpoints
------------
-
-Replaying of the execution of virtual machine is bound by sources of
-non-determinism. These are inputs from clock and peripheral devices,
-and QEMU thread scheduling. Thread scheduling affect on processing events
-from timers, asynchronous input-output, and bottom halves.
-
-Invocations of timers are coupled with clock reads and changing the state
-of the virtual machine. Reads produce non-deterministic data taken from
-host clock. And VM state changes should preserve their order. Their relative
-order in replay mode must replicate the order of callbacks in record mode.
-To preserve this order we use checkpoints. When a specific clock is processed
-in record mode we save to the log special "checkpoint" event.
-Checkpoints here do not refer to virtual machine snapshots. They are just
-record/replay events used for synchronization.
-
-QEMU in replay mode will try to invoke timers processing in random moment
-of time. That's why we do not process a group of timers until the checkpoint
-event will be read from the log. Such an event allows synchronizing CPU
-execution and timer events.
-
-Two other checkpoints govern the "warping" of the virtual clock.
-While the virtual machine is idle, the virtual clock increments at
-1 ns per *real time* nanosecond. This is done by setting up a timer
-(called the warp timer) on the virtual real time clock, so that the
-timer fires at the next deadline of the virtual clock; the virtual clock
-is then incremented (which is called "warping" the virtual clock) as
-soon as the timer fires or the CPUs need to go out of the idle state.
-Two functions are used for this purpose; because these actions change
-virtual machine state and must be deterministic, each of them creates a
-checkpoint. icount_start_warp_timer checks if the CPUs are idle and if so
-starts accounting real time to virtual clock. icount_account_warp_timer
-is called when the CPUs get an interrupt or when the warp timer fires,
-and it warps the virtual clock by the amount of real time that has passed
-since icount_start_warp_timer.
-
-Bottom halves
--------------
-
-Disk I/O events are completely deterministic in our model, because
-in both record and replay modes we start virtual machine from the same
-disk state. But callbacks that virtual disk controller uses for reading and
-writing the disk may occur at different moments of time in record and replay
-modes.
-
-Reading and writing requests are created by CPU thread of QEMU. Later these
-requests proceed to block layer which creates "bottom halves". Bottom
-halves consist of callback and its parameters. They are processed when
-main loop locks the global mutex. These locks are not synchronized with
-replaying process because main loop also processes the events that do not
-affect the virtual machine state (like user interaction with monitor).
-
-That is why we had to implement saving and replaying bottom halves callbacks
-synchronously to the CPU execution. When the callback is about to execute
-it is added to the queue in the replay module. This queue is written to the
-log when its callbacks are executed. In replay mode callbacks are not processed
-until the corresponding event is read from the events log file.
-
-Sometimes the block layer uses asynchronous callbacks for its internal purposes
-(like reading or writing VM snapshots or disk image cluster tables). In this
-case bottom halves are not marked as "replayable" and do not saved
-into the log.
-
-Block devices
--------------
-
-Block devices record/replay module intercepts calls of
-bdrv coroutine functions at the top of block drivers stack.
-To record and replay block operations the drive must be configured
-as following:
- -drive file=disk.qcow2,if=none,snapshot,id=img-direct
- -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
- -device ide-hd,drive=img-blkreplay
-
-blkreplay driver should be inserted between disk image and virtual driver
-controller. Therefore all disk requests may be recorded and replayed.
-
-All block completion operations are added to the queue in the coroutines.
-Queue is flushed at checkpoints and information about processed requests
-is recorded to the log. In replay phase the queue is matched with
-events read from the log. Therefore block devices requests are processed
-deterministically.
-
-Snapshotting
-------------
-
-New VM snapshots may be created in replay mode. They can be used later
-to recover the desired VM state. All VM states created in replay mode
-are associated with the moment of time in the replay scenario.
-After recovering the VM state replay will start from that position.
-
-Default starting snapshot name may be specified with icount field
-rrsnapshot as follows:
- -icount shift=7,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name
-
-This snapshot is created at start of recording and restored at start
-of replaying. It also can be loaded while replaying to roll back
-the execution.
-
-'snapshot' flag of the disk image must be removed to save the snapshots
-in the overlay (or original image) instead of using the temporary overlay.
- -drive file=disk.ovl,if=none,id=img-direct
- -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
- -device ide-hd,drive=img-blkreplay
-
-Use QEMU monitor to create additional snapshots. 'savevm <name>' command
-created the snapshot and 'loadvm <name>' restores it. To prevent corruption
-of the original disk image, use overlay files linked to the original images.
-Therefore all new snapshots (including the starting one) will be saved in
-overlays and the original image remains unchanged.
-
-When you need to use snapshots with diskless virtual machine,
-it must be started with 'orphan' qcow2 image. This image will be used
-for storing VM snapshots. Here is the example of the command line for this:
-
- qemu-system-i386 -icount shift=3,rr=replay,rrfile=record.bin,rrsnapshot=init \
- -net none -drive file=empty.qcow2,if=none,id=rr
-
-empty.qcow2 drive does not connected to any virtual block device and used
-for VM snapshots only.
-
-Network devices
----------------
-
-Record and replay for network interactions is performed with the network filter.
-Each backend must have its own instance of the replay filter as follows:
- -netdev user,id=net1 -device rtl8139,netdev=net1
- -object filter-replay,id=replay,netdev=net1
-
-Replay network filter is used to record and replay network packets. While
-recording the virtual machine this filter puts all packets coming from
-the outer world into the log. In replay mode packets from the log are
-injected into the network device. All interactions with network backend
-in replay mode are disabled.
-
-Audio devices
--------------
-
-Audio data is recorded and replay automatically. The command line for recording
-and replaying must contain identical specifications of audio hardware, e.g.:
- -soundhw ac97
-
-Serial ports
-------------
-
-Serial ports input is recorded and replay automatically. The command lines
-for recording and replaying must contain identical number of ports in record
-and replay modes, but their backends may differ.
-E.g., '-serial stdio' in record mode, and '-serial null' in replay mode.
-
-Reverse debugging
------------------
-
-Reverse debugging allows "executing" the program in reverse direction.
-GDB remote protocol supports "reverse step" and "reverse continue"
-commands. The first one steps single instruction backwards in time,
-and the second one finds the last breakpoint in the past.
-
-Recorded executions may be used to enable reverse debugging. QEMU can't
-execute the code in backwards direction, but can load a snapshot and
-replay forward to find the desired position or breakpoint.
-
-The following GDB commands are supported:
- - reverse-stepi (or rsi) - step one instruction backwards
- - reverse-continue (or rc) - find last breakpoint in the past
-
-Reverse step loads the nearest snapshot and replays the execution until
-the required instruction is met.
-
-Reverse continue may include several passes of examining the execution
-between the snapshots. Each of the passes include the following steps:
- 1. loading the snapshot
- 2. replaying to examine the breakpoints
- 3. if breakpoint or watchpoint was met
- - loading the snapshot again
- - replaying to the required breakpoint
- 4. else
- - proceeding to the p.1 with the earlier snapshot
-
-Therefore usage of the reverse debugging requires at least one snapshot
-created in advance. This can be done by omitting 'snapshot' option
-for the block drives and adding 'rrsnapshot' for both record and replay
-command lines.
-See the "Snapshotting" section to learn more about running record/replay
-and creating the snapshot in these modes.
-
-Replay log format
------------------
-
-Record/replay log consists of the header and the sequence of execution
-events. The header includes 4-byte replay version id and 8-byte reserved
-field. Version is updated every time replay log format changes to prevent
-using replay log created by another build of qemu.
-
-The sequence of the events describes virtual machine state changes.
-It includes all non-deterministic inputs of VM, synchronization marks and
-instruction counts used to correctly inject inputs at replay.
-
-Synchronization marks (checkpoints) are used for synchronizing qemu threads
-that perform operations with virtual hardware. These operations may change
-system's state (e.g., change some register or generate interrupt) and
-therefore should execute synchronously with CPU thread.
-
-Every event in the log includes 1-byte event id and optional arguments.
-When argument is an array, it is stored as 4-byte array length
-and corresponding number of bytes with data.
-Here is the list of events that are written into the log:
-
- - EVENT_INSTRUCTION. Instructions executed since last event.
- Argument: 4-byte number of executed instructions.
- - EVENT_INTERRUPT. Used to synchronize interrupt processing.
- - EVENT_EXCEPTION. Used to synchronize exception handling.
- - EVENT_ASYNC. This is a group of events. They are always processed
- together with checkpoints. When such an event is generated, it is
- stored in the queue and processed only when checkpoint occurs.
- Every such event is followed by 1-byte checkpoint id and 1-byte
- async event id from the following list:
- - REPLAY_ASYNC_EVENT_BH. Bottom-half callback. This event synchronizes
- callbacks that affect virtual machine state, but normally called
- asynchronously.
- Argument: 8-byte operation id.
- - REPLAY_ASYNC_EVENT_INPUT. Input device event. Contains
- parameters of keyboard and mouse input operations
- (key press/release, mouse pointer movement).
- Arguments: 9-16 bytes depending of input event.
- - REPLAY_ASYNC_EVENT_INPUT_SYNC. Internal input synchronization event.
- - REPLAY_ASYNC_EVENT_CHAR_READ. Character (e.g., serial port) device input
- initiated by the sender.
- Arguments: 1-byte character device id.
- Array with bytes were read.
- - REPLAY_ASYNC_EVENT_BLOCK. Block device operation. Used to synchronize
- operations with disk and flash drives with CPU.
- Argument: 8-byte operation id.
- - REPLAY_ASYNC_EVENT_NET. Incoming network packet.
- Arguments: 1-byte network adapter id.
- 4-byte packet flags.
- Array with packet bytes.
- - EVENT_SHUTDOWN. Occurs when user sends shutdown event to qemu,
- e.g., by closing the window.
- - EVENT_CHAR_WRITE. Used to synchronize character output operations.
- Arguments: 4-byte output function return value.
- 4-byte offset in the output array.
- - EVENT_CHAR_READ_ALL. Used to synchronize character input operations,
- initiated by qemu.
- Argument: Array with bytes that were read.
- - EVENT_CHAR_READ_ALL_ERROR. Unsuccessful character input operation,
- initiated by qemu.
- Argument: 4-byte error code.
- - EVENT_CLOCK + clock_id. Group of events for host clock read operations.
- Argument: 8-byte clock value.
- - EVENT_CHECKPOINT + checkpoint_id. Checkpoint for synchronization of
- CPU, internal threads, and asynchronous input events. May be followed
- by one or more EVENT_ASYNC events.
- - EVENT_END. Last event in the log.
diff --git a/docs/requirements.txt b/docs/requirements.txt
new file mode 100644
index 0000000000..02583f209a
--- /dev/null
+++ b/docs/requirements.txt
@@ -0,0 +1,5 @@
+# Used by readthedocs.io
+# Should be in sync with the "installed" key of pythondeps.toml
+
+sphinx==5.3.0
+sphinx_rtd_theme==1.1.1
diff --git a/docs/specs/acpi_erst.rst b/docs/specs/acpi_erst.rst
new file mode 100644
index 0000000000..2339b60ad7
--- /dev/null
+++ b/docs/specs/acpi_erst.rst
@@ -0,0 +1,200 @@
+ACPI ERST DEVICE
+================
+
+The ACPI ERST device is utilized to support the ACPI Error Record
+Serialization Table, ERST, functionality. This feature is designed for
+storing error records in persistent storage for future reference
+and/or debugging.
+
+The ACPI specification[1], in Chapter "ACPI Platform Error Interfaces
+(APEI)", and specifically subsection "Error Serialization", outlines a
+method for storing error records into persistent storage.
+
+The format of error records is described in the UEFI specification[2],
+in Appendix N "Common Platform Error Record".
+
+While the ACPI specification allows for an NVRAM "mode" (see
+GET_ERROR_LOG_ADDRESS_RANGE_ATTRIBUTES) where non-volatile RAM is
+directly exposed for direct access by the OS/guest, this device
+implements the non-NVRAM "mode". This non-NVRAM "mode" is what is
+implemented by most BIOS (since flash memory requires programming
+operations in order to update its contents). Furthermore, as of the
+time of this writing, Linux only supports the non-NVRAM "mode".
+
+
+Background/Motivation
+---------------------
+
+Linux uses the persistent storage filesystem, pstore, to record
+information (eg. dmesg tail) upon panics and shutdowns. Pstore is
+independent of, and runs before, kdump. In certain scenarios (ie.
+hosts/guests with root filesystems on NFS/iSCSI where networking
+software and/or hardware fails, and thus kdump fails), pstore may
+contain information available for post-mortem debugging.
+
+Two common storage backends for the pstore filesystem are ACPI ERST
+and UEFI. Most BIOS implement ACPI ERST. UEFI is not utilized in all
+guests. With QEMU supporting ACPI ERST, it becomes a viable pstore
+storage backend for virtual machines (as it is now for bare metal
+machines).
+
+Enabling support for ACPI ERST facilitates a consistent method to
+capture kernel panic information in a wide range of guests: from
+resource-constrained microvms to very large guests, and in particular,
+in direct-boot environments (which would lack UEFI run-time services).
+
+Note that Microsoft Windows also utilizes the ACPI ERST for certain
+crash information, if available[3].
+
+
+Configuration|Usage
+-------------------
+
+To use ACPI ERST, a memory-backend-file object and acpi-erst device
+can be created, for example:
+
+ qemu ...
+ -object memory-backend-file,id=erstnvram,mem-path=acpi-erst.backing,size=0x10000,share=on \
+ -device acpi-erst,memdev=erstnvram
+
+For proper operation, the ACPI ERST device needs a memory-backend-file
+object with the following parameters:
+
+ - id: The id of the memory-backend-file object is used to associate
+ this memory with the acpi-erst device.
+ - size: The size of the ACPI ERST backing storage. This parameter is
+ required.
+ - mem-path: The location of the ACPI ERST backing storage file. This
+ parameter is also required.
+ - share: The share=on parameter is required so that updates to the
+ ERST backing store are written to the file.
+
+and ERST device:
+
+ - memdev: Is the object id of the memory-backend-file.
+ - record_size: Specifies the size of the records (or slots) in the
+ backend storage. Must be a power of two value greater than or
+ equal to 4096 (PAGE_SIZE).
+
+
+PCI Interface
+-------------
+
+The ERST device is a PCI device with two BARs, one for accessing the
+programming registers, and the other for accessing the record exchange
+buffer.
+
+BAR0 contains the programming interface consisting of ACTION and VALUE
+64-bit registers. All ERST actions/operations/side effects happen on
+the write to the ACTION, by design. Any data needed by the action must
+be placed into VALUE prior to writing ACTION. Reading the VALUE
+simply returns the register contents, which can be updated by a
+previous ACTION.
+
+BAR1 contains the 8KiB record exchange buffer, which is the
+implemented maximum record size.
+
+
+Backend Storage Format
+----------------------
+
+The backend storage is divided into fixed size "slots", 8KiB in
+length, with each slot storing a single record. Not all slots need to
+be occupied, and they need not be occupied in a contiguous fashion.
+The ability to clear/erase specific records allows for the formation
+of unoccupied slots.
+
+Slot 0 contains a backend storage header that identifies the contents
+as ERST and also facilitates efficient access to the records.
+Depending upon the size of the backend storage, additional slots will
+be designated to be a part of the slot 0 header. For example, at 8KiB,
+the slot 0 header can accommodate 1021 records. Thus a storage size
+of 8MiB (8KiB * 1024) requires an additional slot for use by the
+header. In this scenario, slot 0 and slot 1 form the backend storage
+header, and records can be stored starting at slot 2.
+
+Below is an example layout of the backend storage format (for storage
+size less than 8MiB). The size of the storage is a multiple of 8KiB,
+and contains N number of slots to store records. The example below
+shows two records (in CPER format) in the backend storage, while the
+remaining slots are empty/available.
+
+::
+
+ Slot Record
+ <------------------ 8KiB -------------------->
+ +--------------------------------------------+
+ 0 | storage header |
+ +--------------------------------------------+
+ 1 | empty/available |
+ +--------------------------------------------+
+ 2 | CPER |
+ +--------------------------------------------+
+ 3 | CPER |
+ +--------------------------------------------+
+ ... | |
+ +--------------------------------------------+
+ N | empty/available |
+ +--------------------------------------------+
+
+The storage header consists of some basic information and an array
+of CPER record_id's to efficiently access records in the backend
+storage.
+
+All fields in the header are stored in little endian format.
+
+::
+
+ +--------------------------------------------+
+ | magic | 0x0000
+ +--------------------------------------------+
+ | record_offset | record_size | 0x0008
+ +--------------------------------------------+
+ | record_count | reserved | version | 0x0010
+ +--------------------------------------------+
+ | record_id[0] | 0x0018
+ +--------------------------------------------+
+ | record_id[1] | 0x0020
+ +--------------------------------------------+
+ | record_id[...] |
+ +--------------------------------------------+
+ | record_id[N] | 0x1FF8
+ +--------------------------------------------+
+
+The 'magic' field contains the value 0x524F545354535245.
+
+The 'record_size' field contains the value 0x2000, 8KiB.
+
+The 'record_offset' field points to the first record_id in the array,
+0x0018.
+
+The 'version' field contains 0x0100, the first version.
+
+The 'record_count' field contains the number of valid records in the
+backend storage.
+
+The 'record_id' array fields are the 64-bit record identifiers of the
+CPER record in the corresponding slot. Stated differently, the
+location of a CPER record_id in the record_id[] array provides the
+slot index for the corresponding record in the backend storage.
+
+Note that, for example, with a backend storage less than 8MiB, slot 0
+contains the header, so the record_id[0] will never contain a valid
+CPER record_id. Instead slot 1 is the first available slot and thus
+record_id_[1] may contain a CPER.
+
+A 'record_id' of all 0s or all 1s indicates an invalid record (ie. the
+slot is available).
+
+
+References
+----------
+
+[1] "Advanced Configuration and Power Interface Specification",
+ version 4.0, June 2009.
+
+[2] "Unified Extensible Firmware Interface Specification",
+ version 2.1, October 2008.
+
+[3] "Windows Hardware Error Architecture", specifically
+ "Error Record Persistence Mechanism".
diff --git a/docs/specs/edu.txt b/docs/specs/edu.rst
index 0876310809..ae72737dbb 100644
--- a/docs/specs/edu.txt
+++ b/docs/specs/edu.rst
@@ -2,9 +2,10 @@
EDU device
==========
-Copyright (c) 2014-2015 Jiri Slaby
+..
+ Copyright (c) 2014-2015 Jiri Slaby
-This document is licensed under the GPLv2 (or later).
+ This document is licensed under the GPLv2 (or later).
This is an educational device for writing (kernel) drivers. Its original
intention was to support the Linux kernel lectures taught at the Masaryk
@@ -15,10 +16,11 @@ The devices behaves very similar to the PCI bridge present in the COMBO6 cards
developed under the Liberouter wings. Both PCI device ID and PCI space is
inherited from that device.
-Command line switches:
- -device edu[,dma_mask=mask]
+Command line switches
+---------------------
- dma_mask makes the virtual device work with DMA addresses with the given
+``-device edu[,dma_mask=mask]``
+ ``dma_mask`` makes the virtual device work with DMA addresses with the given
mask. For educational purposes, the device supports only 28 bits (256 MiB)
by default. Students shall set dma_mask for the device in the OS driver
properly.
@@ -26,7 +28,8 @@ Command line switches:
PCI specs
---------
-PCI ID: 1234:11e8
+PCI ID:
+ ``1234:11e8``
PCI Region 0:
I/O memory, 1 MB in size. Users are supposed to communicate with the card
@@ -35,24 +38,29 @@ PCI Region 0:
MMIO area spec
--------------
-Only size == 4 accesses are allowed for addresses < 0x80. size == 4 or
-size == 8 for the rest.
+Only ``size == 4`` accesses are allowed for addresses ``< 0x80``.
+``size == 4`` or ``size == 8`` for the rest.
-0x00 (RO) : identification (0xRRrr00edu)
- RR -- major version
- rr -- minor version
+0x00 (RO) : identification
+ Value is in the form ``0xRRrr00edu`` where:
+ - ``RR`` -- major version
+ - ``rr`` -- minor version
0x04 (RW) : card liveness check
- It is a simple value inversion (~ C operator).
+ It is a simple value inversion (``~`` C operator).
0x08 (RW) : factorial computation
The stored value is taken and factorial of it is put back here.
This happens only after factorial bit in the status register (0x20
below) is cleared.
-0x20 (RW) : status register, bitwise OR
- 0x01 -- computing factorial (RO)
- 0x80 -- raise interrupt after finishing factorial computation
+0x20 (RW) : status register
+ Bitwise OR of:
+
+ 0x01
+ computing factorial (RO)
+ 0x80
+ raise interrupt after finishing factorial computation
0x24 (RO) : interrupt status register
It contains values which raised the interrupt (see interrupt raise
@@ -76,13 +84,19 @@ size == 8 for the rest.
0x90 (RW) : DMA transfer count
The size of the area to perform the DMA on.
-0x98 (RW) : DMA command register, bitwise OR
- 0x01 -- start transfer
- 0x02 -- direction (0: from RAM to EDU, 1: from EDU to RAM)
- 0x04 -- raise interrupt 0x100 after finishing the DMA
+0x98 (RW) : DMA command register
+ Bitwise OR of:
+
+ 0x01
+ start transfer
+ 0x02
+ direction (0: from RAM to EDU, 1: from EDU to RAM)
+ 0x04
+ raise interrupt 0x100 after finishing the DMA
IRQ controller
--------------
+
An IRQ is generated when written to the interrupt raise register. The value
appears in interrupt status register when the interrupt is raised and has to
be written to the interrupt acknowledge register to lower it.
@@ -94,22 +108,28 @@ routine.
DMA controller
--------------
+
One has to specify, source, destination, size, and start the transfer. One
4096 bytes long buffer at offset 0x40000 is available in the EDU device. I.e.
one can perform DMA to/from this space when programmed properly.
Example of transferring a 100 byte block to and from the buffer using a given
-PCI address 'addr':
-addr -> DMA source address
-0x40000 -> DMA destination address
-100 -> DMA transfer count
-1 -> DMA command register
-while (DMA command register & 1)
- ;
-
-0x40000 -> DMA source address
-addr+100 -> DMA destination address
-100 -> DMA transfer count
-3 -> DMA command register
-while (DMA command register & 1)
- ;
+PCI address ``addr``:
+
+::
+
+ addr -> DMA source address
+ 0x40000 -> DMA destination address
+ 100 -> DMA transfer count
+ 1 -> DMA command register
+ while (DMA command register & 1)
+ ;
+
+::
+
+ 0x40000 -> DMA source address
+ addr+100 -> DMA destination address
+ 100 -> DMA transfer count
+ 3 -> DMA command register
+ while (DMA command register & 1)
+ ;
diff --git a/docs/specs/fsi.rst b/docs/specs/fsi.rst
new file mode 100644
index 0000000000..af87822531
--- /dev/null
+++ b/docs/specs/fsi.rst
@@ -0,0 +1,122 @@
+======================================
+IBM's Flexible Service Interface (FSI)
+======================================
+
+The QEMU FSI emulation implements hardware interfaces between ASPEED SOC, FSI
+master/slave and the end engine.
+
+FSI is a point-to-point two wire interface which is capable of supporting
+distances of up to 4 meters. FSI interfaces have been used successfully for
+many years in IBM servers to attach IBM Flexible Support Processors(FSP) to
+CPUs and IBM ASICs.
+
+FSI allows a service processor access to the internal buses of a host POWER
+processor to perform configuration or debugging. FSI has long existed in POWER
+processes and so comes with some baggage, including how it has been integrated
+into the ASPEED SoC.
+
+Working backwards from the POWER processor, the fundamental pieces of interest
+for the implementation are: (see the `FSI specification`_ for more details)
+
+1. The Common FRU Access Macro (CFAM), an address space containing various
+ "engines" that drive accesses on buses internal and external to the POWER
+ chip. Examples include the SBEFIFO and I2C masters. The engines hang off of
+ an internal Local Bus (LBUS) which is described by the CFAM configuration
+ block.
+
+2. The FSI slave: The slave is the terminal point of the FSI bus for FSI
+ symbols addressed to it. Slaves can be cascaded off of one another. The
+ slave's configuration registers appear in address space of the CFAM to
+ which it is attached.
+
+3. The FSI master: A controller in the platform service processor (e.g. BMC)
+ driving CFAM engine accesses into the POWER chip. At the hardware level
+ FSI is a bit-based protocol supporting synchronous and DMA-driven accesses
+ of engines in a CFAM.
+
+4. The On-Chip Peripheral Bus (OPB): A low-speed bus typically found in POWER
+ processors. This now makes an appearance in the ASPEED SoC due to tight
+ integration of the FSI master IP with the OPB, mainly the existence of an
+ MMIO-mapping of the CFAM address straight onto a sub-region of the OPB
+ address space.
+
+5. An APB-to-OPB bridge enabling access to the OPB from the ARM core in the
+ AST2600. Hardware limitations prevent the OPB from being directly mapped
+ into APB, so all accesses are indirect through the bridge.
+
+The LBUS is modelled to maintain the qdev bus hierarchy and to take advantages
+of the object model to automatically generate the CFAM configuration block.
+The configuration block presents engines in the order they are attached to the
+CFAM's LBUS. Engine implementations should subclass the LBusDevice and set the
+'config' member of LBusDeviceClass to match the engine's type.
+
+CFAM designs offer a lot of flexibility, for instance it is possible for a
+CFAM to be simultaneously driven from multiple FSI links. The modeling is not
+so complete; it's assumed that each CFAM is attached to a single FSI slave (as
+a consequence the CFAM subclasses the FSI slave).
+
+As for FSI, its symbols and wire-protocol are not modelled at all. This is not
+necessary to get FSI off the ground thanks to the mapping of the CFAM address
+space onto the OPB address space - the models follow this directly and map the
+CFAM memory region into the OPB's memory region.
+
+The following commands start the ``rainier-bmc`` machine with built-in FSI
+model. There are no model specific arguments. Please check this document to
+learn more about Aspeed ``rainier-bmc`` machine: (:doc:`../../system/arm/aspeed`)
+
+.. code-block:: console
+
+ qemu-system-arm -M rainier-bmc -nographic \
+ -kernel fitImage-linux.bin \
+ -dtb aspeed-bmc-ibm-rainier.dtb \
+ -initrd obmc-phosphor-initramfs.rootfs.cpio.xz \
+ -drive file=obmc-phosphor-image.rootfs.wic.qcow2,if=sd,index=2 \
+ -append "rootwait console=ttyS4,115200n8 root=PARTLABEL=rofs-a"
+
+The implementation appears as following in the qemu device tree:
+
+.. code-block:: console
+
+ (qemu) info qtree
+ bus: main-system-bus
+ type System
+ ...
+ dev: aspeed.apb2opb, id ""
+ gpio-out "sysbus-irq" 1
+ mmio 000000001e79b000/0000000000001000
+ bus: opb.1
+ type opb
+ dev: fsi.master, id ""
+ bus: fsi.bus.1
+ type fsi.bus
+ dev: cfam.config, id ""
+ dev: cfam, id ""
+ bus: lbus.1
+ type lbus
+ dev: scratchpad, id ""
+ address = 0 (0x0)
+ bus: opb.0
+ type opb
+ dev: fsi.master, id ""
+ bus: fsi.bus.0
+ type fsi.bus
+ dev: cfam.config, id ""
+ dev: cfam, id ""
+ bus: lbus.0
+ type lbus
+ dev: scratchpad, id ""
+ address = 0 (0x0)
+
+pdbg is a simple application to allow debugging of the host POWER processors
+from the BMC. (see the `pdbg source repository`_ for more details)
+
+.. code-block:: console
+
+ root@p10bmc:~# pdbg -a getcfam 0x0
+ p0: 0x0 = 0xc0022d15
+
+.. _FSI specification:
+ https://openpowerfoundation.org/specifications/fsi/
+
+.. _pdbg source repository:
+ https://github.com/open-power/pdbg
diff --git a/docs/specs/fw_cfg.txt b/docs/specs/fw_cfg.rst
index 3e6d586f66..5ad47a901c 100644
--- a/docs/specs/fw_cfg.txt
+++ b/docs/specs/fw_cfg.rst
@@ -1,7 +1,9 @@
+===========================================
QEMU Firmware Configuration (fw_cfg) Device
===========================================
-= Guest-side Hardware Interface =
+Guest-side Hardware Interface
+=============================
This hardware interface allows the guest to retrieve various data items
(blobs) that can influence how the firmware configures itself, or may
@@ -9,7 +11,8 @@ contain tables to be installed for the guest OS. Examples include device
boot order, ACPI and SMBIOS tables, virtual machine UUID, SMP and NUMA
information, kernel/initrd images for direct (Linux) kernel booting, etc.
-== Selector (Control) Register ==
+Selector (Control) Register
+---------------------------
* Write only
* Location: platform dependent (IOport or MMIO)
@@ -30,10 +33,12 @@ of 1 means the item's data can be overwritten by writes to the data
register. In other words, configuration write mode is enabled when
the selector value is between 0x4000-0x7fff or 0xc000-0xffff.
-NOTE: As of QEMU v2.4, writes to the fw_cfg data register are no
+.. NOTE::
+ As of QEMU v2.4, writes to the fw_cfg data register are no
longer supported, and will be ignored (treated as no-ops)!
-NOTE: As of QEMU v2.9, writes are reinstated, but only through the DMA
+.. NOTE::
+ As of QEMU v2.9, writes are reinstated, but only through the DMA
interface (see below). Furthermore, writeability of any specific item is
governed independently of Bit14 in the selector key value.
@@ -45,17 +50,19 @@ items are accessed with a selector value between 0x0000-0x7fff, and
architecture specific configuration items are accessed with a selector
value between 0x8000-0xffff.
-== Data Register ==
+Data Register
+-------------
* Read/Write (writes ignored as of QEMU v2.4, but see the DMA interface)
-* Location: platform dependent (IOport [*] or MMIO)
+* Location: platform dependent (IOport [#]_ or MMIO)
* Width: 8-bit (if IOport), 8/16/32/64-bit (if MMIO)
* Endianness: string-preserving
-[*] On platforms where the data register is exposed as an IOport, its
-port number will always be one greater than the port number of the
-selector register. In other words, the two ports overlap, and can not
-be mapped separately.
+.. [#]
+ On platforms where the data register is exposed as an IOport, its
+ port number will always be one greater than the port number of the
+ selector register. In other words, the two ports overlap, and can not
+ be mapped separately.
The data register allows access to an array of bytes for each firmware
configuration data item. The specific item is selected by writing to
@@ -74,91 +81,103 @@ An N-byte wide read of the data register will return the next available
N bytes of the selected firmware configuration item, as a substring, in
increasing address order, similar to memcpy().
-== Register Locations ==
-
-=== x86, x86_64 Register Locations ===
+Register Locations
+------------------
-Selector Register IOport: 0x510
-Data Register IOport: 0x511
-DMA Address IOport: 0x514
+x86, x86_64
+ * Selector Register IOport: 0x510
+ * Data Register IOport: 0x511
+ * DMA Address IOport: 0x514
-=== Arm Register Locations ===
+Arm
+ * Selector Register address: Base + 8 (2 bytes)
+ * Data Register address: Base + 0 (8 bytes)
+ * DMA Address address: Base + 16 (8 bytes)
-Selector Register address: Base + 8 (2 bytes)
-Data Register address: Base + 0 (8 bytes)
-DMA Address address: Base + 16 (8 bytes)
+ACPI Interface
+--------------
-== ACPI Interface ==
-
-The fw_cfg device is defined with ACPI ID "QEMU0002". Since we expect
+The fw_cfg device is defined with ACPI ID ``QEMU0002``. Since we expect
ACPI tables to be passed into the guest through the fw_cfg device itself,
the guest-side firmware can not use ACPI to find fw_cfg. However, once the
firmware is finished setting up ACPI tables and hands control over to the
guest kernel, the latter can use the fw_cfg ACPI node for a more accurate
inventory of in-use IOport or MMIO regions.
-== Firmware Configuration Items ==
+Firmware Configuration Items
+----------------------------
-=== Signature (Key 0x0000, FW_CFG_SIGNATURE) ===
+Signature (Key 0x0000, ``FW_CFG_SIGNATURE``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The presence of the fw_cfg selector and data registers can be verified
-by selecting the "signature" item using key 0x0000 (FW_CFG_SIGNATURE),
+by selecting the "signature" item using key 0x0000 (``FW_CFG_SIGNATURE``),
and reading four bytes from the data register. If the fw_cfg device is
-present, the four bytes read will contain the characters "QEMU".
+present, the four bytes read will contain the characters ``QEMU``.
If the DMA interface is available, then reading the DMA Address
-Register returns 0x51454d5520434647 ("QEMU CFG" in big-endian format).
+Register returns 0x51454d5520434647 (``QEMU CFG`` in big-endian format).
-=== Revision / feature bitmap (Key 0x0001, FW_CFG_ID) ===
+Revision / feature bitmap (Key 0x0001, ``FW_CFG_ID``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A 32-bit little-endian unsigned int, this item is used to check for enabled
features.
- - Bit 0: traditional interface. Always set.
- - Bit 1: DMA interface.
-=== File Directory (Key 0x0019, FW_CFG_FILE_DIR) ===
+- Bit 0: traditional interface. Always set.
+- Bit 1: DMA interface.
+
+File Directory (Key 0x0019, ``FW_CFG_FILE_DIR``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. highlight:: c
Firmware configuration items stored at selector keys 0x0020 or higher
-(FW_CFG_FILE_FIRST or higher) have an associated entry in a directory
+(``FW_CFG_FILE_FIRST`` or higher) have an associated entry in a directory
structure, which makes it easier for guest-side firmware to identify
-and retrieve them. The format of this file directory (from fw_cfg.h in
-the QEMU source tree) is shown here, slightly annotated for clarity:
+and retrieve them. The format of this file directory (from ``fw_cfg.h`` in
+the QEMU source tree) is shown here, slightly annotated for clarity::
-struct FWCfgFiles { /* the entire file directory fw_cfg item */
- uint32_t count; /* number of entries, in big-endian format */
- struct FWCfgFile f[]; /* array of file entries, see below */
-};
+ struct FWCfgFiles { /* the entire file directory fw_cfg item */
+ uint32_t count; /* number of entries, in big-endian format */
+ struct FWCfgFile f[]; /* array of file entries, see below */
+ };
-struct FWCfgFile { /* an individual file entry, 64 bytes total */
- uint32_t size; /* size of referenced fw_cfg item, big-endian */
- uint16_t select; /* selector key of fw_cfg item, big-endian */
- uint16_t reserved;
- char name[56]; /* fw_cfg item name, NUL-terminated ascii */
-};
+ struct FWCfgFile { /* an individual file entry, 64 bytes total */
+ uint32_t size; /* size of referenced fw_cfg item, big-endian */
+ uint16_t select; /* selector key of fw_cfg item, big-endian */
+ uint16_t reserved;
+ char name[56]; /* fw_cfg item name, NUL-terminated ascii */
+ };
-=== All Other Data Items ===
+All Other Data Items
+~~~~~~~~~~~~~~~~~~~~
Please consult the QEMU source for the most up-to-date and authoritative list
of selector keys and their respective items' purpose, format and writeability.
-=== Ranges ===
+Ranges
+~~~~~~
Theoretically, there may be up to 0x4000 generic firmware configuration
items, and up to 0x4000 architecturally specific ones.
+=============== ===========
Selector Reg. Range Usage
---------------- -----------
+=============== ===========
0x0000 - 0x3fff Generic (0x0000 - 0x3fff, generally RO, possibly RW through
- the DMA interface in QEMU v2.9+)
+ the DMA interface in QEMU v2.9+)
0x4000 - 0x7fff Generic (0x0000 - 0x3fff, RW, ignored in QEMU v2.4+)
0x8000 - 0xbfff Arch. Specific (0x0000 - 0x3fff, generally RO, possibly RW
- through the DMA interface in QEMU v2.9+)
+ through the DMA interface in QEMU v2.9+)
0xc000 - 0xffff Arch. Specific (0x0000 - 0x3fff, RW, ignored in v2.4+)
+=============== ===========
In practice, the number of allowed firmware configuration items depends on the
machine type/version.
-= Guest-side DMA Interface =
+Guest-side DMA Interface
+========================
If bit 1 of the feature bitmap is set, the DMA interface is present. This does
not replace the existing fw_cfg interface, it is an add-on. This interface
@@ -171,68 +190,74 @@ addresses can be triggered with just one write, whereas operations with
64-bit addresses can be triggered with one 64-bit write or two 32-bit writes,
starting with the most significant half (at offset 0).
-In this register, the physical address of a FWCfgDmaAccess structure in RAM
-should be written. This is the format of the FWCfgDmaAccess structure:
+In this register, the physical address of a ``FWCfgDmaAccess`` structure in RAM
+should be written. This is the format of the ``FWCfgDmaAccess`` structure::
-typedef struct FWCfgDmaAccess {
- uint32_t control;
- uint32_t length;
- uint64_t address;
-} FWCfgDmaAccess;
+ typedef struct FWCfgDmaAccess {
+ uint32_t control;
+ uint32_t length;
+ uint64_t address;
+ } FWCfgDmaAccess;
The fields of the structure are in big endian mode, and the field at the lowest
-address is the "control" field.
+address is the ``control`` field.
+
+The ``control`` field has the following bits:
-The "control" field has the following bits:
- - Bit 0: Error
- - Bit 1: Read
- - Bit 2: Skip
- - Bit 3: Select. The upper 16 bits are the selected index.
- - Bit 4: Write
+- Bit 0: Error
+- Bit 1: Read
+- Bit 2: Skip
+- Bit 3: Select. The upper 16 bits are the selected index.
+- Bit 4: Write
-When an operation is triggered, if the "control" field has bit 3 set, the
+When an operation is triggered, if the ``control`` field has bit 3 set, the
upper 16 bits are interpreted as an index of a firmware configuration item.
This has the same effect as writing the selector register.
-If the "control" field has bit 1 set, a read operation will be performed.
-"length" bytes for the current selector and offset will be copied into the
-physical RAM address specified by the "address" field.
+If the ``control`` field has bit 1 set, a read operation will be performed.
+``length`` bytes for the current selector and offset will be copied into the
+physical RAM address specified by the ``address`` field.
-If the "control" field has bit 4 set (and not bit 1), a write operation will be
-performed. "length" bytes will be copied from the physical RAM address
-specified by the "address" field to the current selector and offset. QEMU
+If the ``control`` field has bit 4 set (and not bit 1), a write operation will be
+performed. ``length`` bytes will be copied from the physical RAM address
+specified by the ``address`` field to the current selector and offset. QEMU
prevents starting or finishing the write beyond the end of the item associated
with the current selector (i.e., the item cannot be resized). Truncated writes
are dropped entirely. Writes to read-only items are also rejected. All of these
-write errors set bit 0 (the error bit) in the "control" field.
+write errors set bit 0 (the error bit) in the ``control`` field.
-If the "control" field has bit 2 set (and neither bit 1 nor bit 4), a skip
+If the ``control`` field has bit 2 set (and neither bit 1 nor bit 4), a skip
operation will be performed. The offset for the current selector will be
-advanced "length" bytes.
+advanced ``length`` bytes.
+
+To check the result, read the ``control`` field:
-To check the result, read the "control" field:
- error bit set -> something went wrong.
- all bits cleared -> transfer finished successfully.
- otherwise -> transfer still in progress (doesn't happen
- today due to implementation not being async,
- but may in the future).
+Error bit set
+ Something went wrong.
+All bits cleared
+ Transfer finished successfully.
+Otherwise
+ Transfer still in progress
+ (doesn't happen today due to implementation not being async,
+ but may in the future).
-= Externally Provided Items =
+Externally Provided Items
+=========================
Since v2.4, "file" fw_cfg items (i.e., items with selector keys above
-FW_CFG_FILE_FIRST, and with a corresponding entry in the fw_cfg file
+``FW_CFG_FILE_FIRST``, and with a corresponding entry in the fw_cfg file
directory structure) may be inserted via the QEMU command line, using
-the following syntax:
+the following syntax::
-fw_cfg [name=]<item_name>,file=<path>
-Or
+Or::
-fw_cfg [name=]<item_name>,string=<string>
Since v5.1, QEMU allows some objects to generate fw_cfg-specific content,
the content is then associated with a "file" item using the 'gen_id' option
-in the command line, using the following syntax:
+in the command line, using the following syntax::
-object <generator-type>,id=<generated_id>,[generator-specific-options] \
-fw_cfg [name=]<item_name>,gen_id=<generated_id>
@@ -241,24 +266,24 @@ See QEMU man page for more documentation.
Using item_name with plain ASCII characters only is recommended.
-Item names beginning with "opt/" are reserved for users. QEMU will
+Item names beginning with ``opt/`` are reserved for users. QEMU will
never create entries with such names unless explicitly ordered by the
user.
To avoid clashes among different users, it is strongly recommended
-that you use names beginning with opt/RFQDN/, where RFQDN is a reverse
+that you use names beginning with ``opt/RFQDN/``, where RFQDN is a reverse
fully qualified domain name you control. For instance, if SeaBIOS
-wanted to define additional names, the prefix "opt/org.seabios/" would
+wanted to define additional names, the prefix ``opt/org.seabios/`` would
be appropriate.
-For historical reasons, "opt/ovmf/" is reserved for OVMF firmware.
+For historical reasons, ``opt/ovmf/`` is reserved for OVMF firmware.
-Prefix "opt/org.qemu/" is reserved for QEMU itself.
+Prefix ``opt/org.qemu/`` is reserved for QEMU itself.
-Use of names not beginning with "opt/" is potentially dangerous and
+Use of names not beginning with ``opt/`` is potentially dangerous and
entirely unsupported. QEMU will warn if you try.
-Use of names not beginning with "opt/" is tolerated with 'gen_id' (that
+Use of names not beginning with ``opt/`` is tolerated with 'gen_id' (that
is, the warning is suppressed), but you must know exactly what you're
doing.
diff --git a/docs/specs/index.rst b/docs/specs/index.rst
index ecc43896bb..1484e3e760 100644
--- a/docs/specs/index.rst
+++ b/docs/specs/index.rst
@@ -8,6 +8,9 @@ guest hardware that is specific to QEMU.
.. toctree::
:maxdepth: 2
+ pci-ids
+ pci-serial
+ pci-testdev
ppc-xive
ppc-spapr-xive
ppc-spapr-numa
@@ -18,3 +21,15 @@ guest hardware that is specific to QEMU.
acpi_mem_hotplug
acpi_pci_hotplug
acpi_nvdimm
+ acpi_erst
+ sev-guest-firmware
+ fw_cfg
+ fsi
+ vmw_pvscsi-spec
+ edu
+ ivshmem-spec
+ pvpanic
+ standard-vga
+ virt-ctlr
+ vmcoreinfo
+ vmgenid
diff --git a/docs/specs/ivshmem-spec.txt b/docs/specs/ivshmem-spec.rst
index 1beb3a01ec..2d8e80055b 100644
--- a/docs/specs/ivshmem-spec.txt
+++ b/docs/specs/ivshmem-spec.rst
@@ -1,4 +1,6 @@
-= Device Specification for Inter-VM shared memory device =
+======================================================
+Device Specification for Inter-VM shared memory device
+======================================================
The Inter-VM shared memory device (ivshmem) is designed to share a
memory region between multiple QEMU processes running different guests
@@ -12,42 +14,17 @@ can obtain one from an ivshmem server.
In the latter case, the device can additionally interrupt its peers, and
get interrupted by its peers.
+For information on configuring the ivshmem device on the QEMU
+command line, see :doc:`../system/devices/ivshmem`.
-== Configuring the ivshmem PCI device ==
-
-There are two basic configurations:
-
-- Just shared memory:
-
- -device ivshmem-plain,memdev=HMB,...
-
- This uses host memory backend HMB. It should have option "share"
- set.
-
-- Shared memory plus interrupts:
-
- -device ivshmem-doorbell,chardev=CHR,vectors=N,...
-
- An ivshmem server must already be running on the host. The device
- connects to the server's UNIX domain socket via character device
- CHR.
-
- Each peer gets assigned a unique ID by the server. IDs must be
- between 0 and 65535.
-
- Interrupts are message-signaled (MSI-X). vectors=N configures the
- number of vectors to use.
-
-For more details on ivshmem device properties, see the QEMU Emulator
-user documentation.
-
-
-== The ivshmem PCI device's guest interface ==
+The ivshmem PCI device's guest interface
+========================================
The device has vendor ID 1af4, device ID 1110, revision 1. Before
QEMU 2.6.0, it had revision 0.
-=== PCI BARs ===
+PCI BARs
+--------
The ivshmem PCI device has two or three BARs:
@@ -59,8 +36,7 @@ There are two ways to use this device:
- If you only need the shared memory part, BAR2 suffices. This way,
you have access to the shared memory in the guest and can use it as
- you see fit. Memnic, for example, uses ivshmem this way from guest
- user space (see http://dpdk.org/browse/memnic).
+ you see fit.
- If you additionally need the capability for peers to interrupt each
other, you need BAR0 and BAR1. You will most likely want to write a
@@ -77,10 +53,13 @@ accessing BAR2.
Revision 0 of the device is not capable to tell guest software whether
it is configured for interrupts.
-=== PCI device registers ===
+PCI device registers
+--------------------
BAR 0 contains the following registers:
+::
+
Offset Size Access On reset Function
0 4 read/write 0 Interrupt Mask
bit 0: peer interrupt (rev 0)
@@ -145,18 +124,20 @@ With multiple MSI-X vectors, different vectors can be used to indicate
different events have occurred. The semantics of interrupt vectors
are left to the application.
-
-== Interrupt infrastructure ==
+Interrupt infrastructure
+========================
When configured for interrupts, the peers share eventfd objects in
addition to shared memory. The shared resources are managed by an
ivshmem server.
-=== The ivshmem server ===
+The ivshmem server
+------------------
The server listens on a UNIX domain socket.
For each new client that connects to the server, the server
+
- picks an ID,
- creates eventfd file descriptors for the interrupt vectors,
- sends the ID and the file descriptor for the shared memory to the
@@ -189,7 +170,8 @@ vectors.
A standalone client is in contrib/ivshmem-client/. It can be useful
for debugging.
-=== The ivshmem Client-Server Protocol ===
+The ivshmem Client-Server Protocol
+----------------------------------
An ivshmem device configured for interrupts connects to an ivshmem
server. This section details the protocol between the two.
@@ -245,7 +227,8 @@ Known bugs:
* The protocol is poorly designed.
-=== The ivshmem Client-Client Protocol ===
+The ivshmem Client-Client Protocol
+----------------------------------
An ivshmem device configured for interrupts receives eventfd file
descriptors for interrupting peers and getting interrupted by peers
diff --git a/docs/specs/pci-ids.rst b/docs/specs/pci-ids.rst
new file mode 100644
index 0000000000..c0a3dec2e7
--- /dev/null
+++ b/docs/specs/pci-ids.rst
@@ -0,0 +1,100 @@
+================
+PCI IDs for QEMU
+================
+
+Red Hat, Inc. donates a part of its device ID range to QEMU, to be used for
+virtual devices. The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36.
+
+Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned
+for your devices.
+
+1af4 vendor ID
+--------------
+
+The 1000 -> 10ff device ID range is used as follows for virtio-pci devices.
+Note that this allocation is separate from the virtio device IDs, which are
+maintained as part of the virtio specification.
+
+1af4:1000
+ network device (legacy)
+1af4:1001
+ block device (legacy)
+1af4:1002
+ balloon device (legacy)
+1af4:1003
+ console device (legacy)
+1af4:1004
+ SCSI host bus adapter device (legacy)
+1af4:1005
+ entropy generator device (legacy)
+1af4:1009
+ 9p filesystem device (legacy)
+1af4:1012
+ vsock device (bug compatibility)
+
+1af4:1040 to 1af4:10ef
+ ID range for modern virtio devices. The PCI device
+ ID is calculated from the virtio device ID by adding the
+ 0x1040 offset. The virtio IDs are defined in the virtio
+ specification. The Linux kernel has a header file with
+ defines for all virtio IDs (``linux/virtio_ids.h``); QEMU has a
+ copy in ``include/standard-headers/``.
+
+1af4:10f0 to 1a4f:10ff
+ Available for experimental usage without registration. Must get
+ official ID when the code leaves the test lab (i.e. when seeking
+ upstream merge or shipping a distro/product) to avoid conflicts.
+
+1af4:1100
+ Used as PCI Subsystem ID for existing hardware devices emulated
+ by QEMU.
+
+1af4:1110
+ ivshmem device (:doc:`ivshmem-spec`)
+
+All other device IDs are reserved.
+
+1b36 vendor ID
+--------------
+
+The 0000 -> 00ff device ID range is used as follows for QEMU-specific
+PCI devices (other than virtio):
+
+1b36:0001
+ PCI-PCI bridge
+1b36:0002
+ PCI serial port (16550A) adapter (:doc:`pci-serial`)
+1b36:0003
+ PCI Dual-port 16550A adapter (:doc:`pci-serial`)
+1b36:0004
+ PCI Quad-port 16550A adapter (:doc:`pci-serial`)
+1b36:0005
+ PCI test device (:doc:`pci-testdev`)
+1b36:0006
+ PCI Rocker Ethernet switch device
+1b36:0007
+ PCI SD Card Host Controller Interface (SDHCI)
+1b36:0008
+ PCIe host bridge
+1b36:0009
+ PCI Expander Bridge (-device pxb)
+1b36:000a
+ PCI-PCI bridge (multiseat)
+1b36:000b
+ PCIe Expander Bridge (-device pxb-pcie)
+1b36:000d
+ PCI xhci usb host adapter
+1b36:000f
+ mdpy (mdev sample device), ``linux/samples/vfio-mdev/mdpy.c``
+1b36:0010
+ PCIe NVMe device (``-device nvme``)
+1b36:0011
+ PCI PVPanic device (``-device pvpanic-pci``)
+1b36:0012
+ PCI ACPI ERST device (``-device acpi-erst``)
+1b36:0013
+ PCI UFS device (``-device ufs``)
+
+All these devices are documented in :doc:`index`.
+
+The 0100 device ID is used for the QXL video card device.
diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt
deleted file mode 100644
index 5e407a6f32..0000000000
--- a/docs/specs/pci-ids.txt
+++ /dev/null
@@ -1,71 +0,0 @@
-
-PCI IDs for qemu
-================
-
-Red Hat, Inc. donates a part of its device ID range to qemu, to be used for
-virtual devices. The vendor IDs are 1af4 (formerly Qumranet ID) and 1b36.
-
-Contact Gerd Hoffmann <kraxel@redhat.com> to get a device ID assigned
-for your devices.
-
-1af4 vendor ID
---------------
-
-The 1000 -> 10ff device ID range is used as follows for virtio-pci devices.
-Note that this allocation separate from the virtio device IDs, which are
-maintained as part of the virtio specification.
-
-1af4:1000 network device (legacy)
-1af4:1001 block device (legacy)
-1af4:1002 balloon device (legacy)
-1af4:1003 console device (legacy)
-1af4:1004 SCSI host bus adapter device (legacy)
-1af4:1005 entropy generator device (legacy)
-1af4:1009 9p filesystem device (legacy)
-
-1af4:1041 network device (modern)
-1af4:1042 block device (modern)
-1af4:1043 console device (modern)
-1af4:1044 entropy generator device (modern)
-1af4:1045 balloon device (modern)
-1af4:1048 SCSI host bus adapter device (modern)
-1af4:1049 9p filesystem device (modern)
-1af4:1050 virtio gpu device (modern)
-1af4:1052 virtio input device (modern)
-
-1af4:10f0 Available for experimental usage without registration. Must get
- to official ID when the code leaves the test lab (i.e. when seeking
-1af4:10ff upstream merge or shipping a distro/product) to avoid conflicts.
-
-1af4:1100 Used as PCI Subsystem ID for existing hardware devices emulated
- by qemu.
-
-1af4:1110 ivshmem device (shared memory, docs/specs/ivshmem-spec.txt)
-
-All other device IDs are reserved.
-
-1b36 vendor ID
---------------
-
-The 0000 -> 00ff device ID range is used as follows for QEMU-specific
-PCI devices (other than virtio):
-
-1b36:0001 PCI-PCI bridge
-1b36:0002 PCI serial port (16550A) adapter (docs/specs/pci-serial.txt)
-1b36:0003 PCI Dual-port 16550A adapter (docs/specs/pci-serial.txt)
-1b36:0004 PCI Quad-port 16550A adapter (docs/specs/pci-serial.txt)
-1b36:0005 PCI test device (docs/specs/pci-testdev.txt)
-1b36:0006 PCI Rocker Ethernet switch device
-1b36:0007 PCI SD Card Host Controller Interface (SDHCI)
-1b36:0008 PCIe host bridge
-1b36:0009 PCI Expander Bridge (-device pxb)
-1b36:000a PCI-PCI bridge (multiseat)
-1b36:000b PCIe Expander Bridge (-device pxb-pcie)
-1b36:000d PCI xhci usb host adapter
-1b36:000f mdpy (mdev sample device), linux/samples/vfio-mdev/mdpy.c
-1b36:0010 PCIe NVMe device (-device nvme)
-1b36:0011 PCI PVPanic device (-device pvpanic-pci)
-
-All these devices are documented in docs/specs.
-
-The 0100 device ID is used for the QXL video card device.
diff --git a/docs/specs/pci-serial.rst b/docs/specs/pci-serial.rst
new file mode 100644
index 0000000000..8d916a3669
--- /dev/null
+++ b/docs/specs/pci-serial.rst
@@ -0,0 +1,37 @@
+=======================
+QEMU PCI serial devices
+=======================
+
+QEMU implements some PCI serial devices which are simple PCI
+wrappers around one or more 16550 UARTs.
+
+There is one single-port variant and two multiport-variants. Linux
+guests work out-of-the box with all cards. There is a Windows inf file
+(``docs/qemupciserial.inf``) to set up the cards in Windows guests.
+
+
+Single-port card
+----------------
+
+Name:
+ ``pci-serial``
+PCI ID:
+ 1b36:0002
+PCI Region 0:
+ IO bar, 8 bytes long, with the 16550 UART mapped to it.
+Interrupt:
+ Wired to pin A.
+
+
+Multiport cards
+---------------
+
+Name:
+ ``pci-serial-2x``, ``pci-serial-4x``
+PCI ID:
+ 1b36:0003 (``-2x``) and 1b36:0004 (``-4x``)
+PCI Region 0:
+ IO bar, with two or four 16550 UARTs mapped after each other.
+ The first is at offset 0, the second at offset 8, and so on.
+Interrupt:
+ Wired to pin A.
diff --git a/docs/specs/pci-serial.txt b/docs/specs/pci-serial.txt
deleted file mode 100644
index 66c761f2b4..0000000000
--- a/docs/specs/pci-serial.txt
+++ /dev/null
@@ -1,34 +0,0 @@
-
-QEMU pci serial devices
-=======================
-
-There is one single-port variant and two muliport-variants. Linux
-guests out-of-the box with all cards. There is a Windows inf file
-(docs/qemupciserial.inf) to setup the single-port card in Windows
-guests.
-
-
-single-port card
-----------------
-
-Name: pci-serial
-PCI ID: 1b36:0002
-
-PCI Region 0:
- IO bar, 8 bytes long, with the 16550 uart mapped to it.
- Interrupt is wired to pin A.
-
-
-multiport cards
----------------
-
-Name: pci-serial-2x
-PCI ID: 1b36:0003
-
-Name: pci-serial-4x
-PCI ID: 1b36:0004
-
-PCI Region 0:
- IO bar, with two/four 16550 uart mapped after each other.
- The first is at offset 0, second at offset 8, ...
- Interrupt is wired to pin A.
diff --git a/docs/specs/pci-testdev.rst b/docs/specs/pci-testdev.rst
new file mode 100644
index 0000000000..4b6d36543b
--- /dev/null
+++ b/docs/specs/pci-testdev.rst
@@ -0,0 +1,39 @@
+====================
+QEMU PCI test device
+====================
+
+``pci-testdev`` is a device used for testing low level IO.
+
+The device implements up to three BARs: BAR0, BAR1 and BAR2.
+Each of BAR 0+1 can be memory or IO. Guests must detect
+BAR types and act accordingly.
+
+BAR 0+1 size is up to 4K bytes each.
+BAR 0+1 starts with the following header:
+
+.. code-block:: c
+
+ typedef struct PCITestDevHdr {
+ uint8_t test; /* write-only, starts a given test number */
+ uint8_t width_type; /*
+ * read-only, type and width of access for a given test.
+ * 1,2,4 for byte,word or long write.
+ * any other value if test not supported on this BAR
+ */
+ uint8_t pad0[2];
+ uint32_t offset; /* read-only, offset in this BAR for a given test */
+ uint32_t data; /* read-only, data to use for a given test */
+ uint32_t count; /* for debugging. number of writes detected. */
+ uint8_t name[]; /* for debugging. 0-terminated ASCII string. */
+ } PCITestDevHdr;
+
+All registers are little endian.
+
+The device is expected to always implement tests 0 to N on each BAR, and to add new
+tests with higher numbers. In this way a guest can scan test numbers until it
+detects an access type that it does not support on this BAR, then stop.
+
+BAR2 is a 64bit memory BAR, without backing storage. It is disabled
+by default and can be enabled using the ``membar=<size>`` property. This
+can be used to test whether guests handle PCI BARs of a specific
+(possibly quite large) size correctly.
diff --git a/docs/specs/pci-testdev.txt b/docs/specs/pci-testdev.txt
deleted file mode 100644
index 4280a1e73c..0000000000
--- a/docs/specs/pci-testdev.txt
+++ /dev/null
@@ -1,31 +0,0 @@
-pci-test is a device used for testing low level IO
-
-device implements up to three BARs: BAR0, BAR1 and BAR2.
-Each of BAR 0+1 can be memory or IO. Guests must detect
-BAR types and act accordingly.
-
-BAR 0+1 size is up to 4K bytes each.
-BAR 0+1 starts with the following header:
-
-typedef struct PCITestDevHdr {
- uint8_t test; <- write-only, starts a given test number
- uint8_t width_type; <- read-only, type and width of access for a given test.
- 1,2,4 for byte,word or long write.
- any other value if test not supported on this BAR
- uint8_t pad0[2];
- uint32_t offset; <- read-only, offset in this BAR for a given test
- uint32_t data; <- read-only, data to use for a given test
- uint32_t count; <- for debugging. number of writes detected.
- uint8_t name[]; <- for debugging. 0-terminated ASCII string.
-} PCITestDevHdr;
-
-All registers are little endian.
-
-device is expected to always implement tests 0 to N on each BAR, and to add new
-tests with higher numbers. In this way a guest can scan test numbers until it
-detects an access type that it does not support on this BAR, then stop.
-
-BAR2 is a 64bit memory bar, without backing storage. It is disabled
-by default and can be enabled using the membar=<size> property. This
-can be used to test whether guests handle pci bars of a specific
-(possibly quite large) size correctly.
diff --git a/docs/specs/ppc-spapr-hcalls.rst b/docs/specs/ppc-spapr-hcalls.rst
new file mode 100644
index 0000000000..6cdcef2026
--- /dev/null
+++ b/docs/specs/ppc-spapr-hcalls.rst
@@ -0,0 +1,99 @@
+======================
+sPAPR hypervisor calls
+======================
+
+When used with the ``pseries`` machine type, ``qemu-system-ppc64`` implements
+a set of hypervisor calls (a.k.a. hcalls) defined in the Linux on Power
+Architecture Reference ([LoPAR]_) document. This document is a subset of the
+Power Architecture Platform Reference (PAPR+) specification (IBM internal only),
+which is what PowerVM, the IBM proprietary hypervisor, adheres to.
+
+The subset in LoPAR is selected based on the requirements of Linux as a guest.
+
+In addition to those calls, we have added our own private hypervisor
+calls which are mostly used as a private interface between the firmware
+running in the guest and QEMU.
+
+All those hypercalls start at hcall number 0xf000 which correspond
+to an implementation specific range in PAPR.
+
+``H_RTAS (0xf000)``
+===================
+
+RTAS stands for Run-Time Abstraction Sercies and is a set of runtime services
+generally provided by the firmware inside the guest to the operating system. It
+predates the existence of hypervisors (it was originally an extension to Open
+Firmware) and is still used by PAPR and LoPAR to provide various services that
+are not performance sensitive.
+
+We currently implement the RTAS services in QEMU itself. The actual RTAS
+"firmware" blob in the guest is a small stub of a few instructions which
+calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU.
+
+Arguments:
+
+ ``r3``: ``H_RTAS (0xf000)``
+
+ ``r4``: Guest physical address of RTAS parameter block.
+
+Returns:
+
+ ``H_SUCCESS``: Successfully called the RTAS function (RTAS result will have
+ been stored in the parameter block).
+
+ ``H_PARAMETER``: Unknown token.
+
+``H_LOGICAL_MEMOP (0xf001)``
+============================
+
+When the guest runs in "real mode" (in powerpc terminology this means with MMU
+disabled, i.e. guest effective address equals to guest physical address), it
+only has access to a subset of memory and no I/Os.
+
+PAPR and LoPAR provides a set of hypervisor calls to perform cacheable or
+non-cacheable accesses to any guest physical addresses that the
+guest can use in order to access IO devices while in real mode.
+
+This is typically used by the firmware running in the guest.
+
+However, doing a hypercall for each access is extremely inefficient
+(even more so when running KVM) when accessing the frame buffer. In
+that case, things like scrolling become unusably slow.
+
+This hypercall allows the guest to request a "memory op" to be applied
+to memory. The supported memory ops at this point are to copy a range
+of memory (supports overlap of source and destination) and XOR which
+is used by our SLOF firmware to invert the screen.
+
+Arguments:
+
+ ``r3 ``: ``H_LOGICAL_MEMOP (0xf001)``
+
+ ``r4``: Guest physical address of destination.
+
+ ``r5``: Guest physical address of source.
+
+ ``r6``: Individual element size, defined by the binary logarithm of the
+ desired size. Supported values are:
+
+ ``0`` = 1 byte
+
+ ``1`` = 2 bytes
+
+ ``2`` = 4 bytes
+
+ ``3`` = 8 bytes
+
+ ``r7``: Number of elements.
+
+ ``r8``: Operation. Supported values are:
+
+ ``0``: copy
+
+ ``1``: xor
+
+Returns:
+
+ ``H_SUCCESS``: Success.
+
+ ``H_PARAMETER``: Invalid argument.
diff --git a/docs/specs/ppc-spapr-hcalls.txt b/docs/specs/ppc-spapr-hcalls.txt
deleted file mode 100644
index 93fe3da91b..0000000000
--- a/docs/specs/ppc-spapr-hcalls.txt
+++ /dev/null
@@ -1,78 +0,0 @@
-When used with the "pseries" machine type, QEMU-system-ppc64 implements
-a set of hypervisor calls using a subset of the server "PAPR" specification
-(IBM internal at this point), which is also what IBM's proprietary hypervisor
-adheres too.
-
-The subset is selected based on the requirements of Linux as a guest.
-
-In addition to those calls, we have added our own private hypervisor
-calls which are mostly used as a private interface between the firmware
-running in the guest and QEMU.
-
-All those hypercalls start at hcall number 0xf000 which correspond
-to an implementation specific range in PAPR.
-
-- H_RTAS (0xf000)
-
-RTAS is a set of runtime services generally provided by the firmware
-inside the guest to the operating system. It predates the existence
-of hypervisors (it was originally an extension to Open Firmware) and
-is still used by PAPR to provide various services that aren't performance
-sensitive.
-
-We currently implement the RTAS services in QEMU itself. The actual RTAS
-"firmware" blob in the guest is a small stub of a few instructions which
-calls our private H_RTAS hypervisor call to pass the RTAS calls to QEMU.
-
-Arguments:
-
- r3 : H_RTAS (0xf000)
- r4 : Guest physical address of RTAS parameter block
-
-Returns:
-
- H_SUCCESS : Successfully called the RTAS function (RTAS result
- will have been stored in the parameter block)
- H_PARAMETER : Unknown token
-
-- H_LOGICAL_MEMOP (0xf001)
-
-When the guest runs in "real mode" (in powerpc lingua this means
-with MMU disabled, ie guest effective == guest physical), it only
-has access to a subset of memory and no IOs.
-
-PAPR provides a set of hypervisor calls to perform cacheable or
-non-cacheable accesses to any guest physical addresses that the
-guest can use in order to access IO devices while in real mode.
-
-This is typically used by the firmware running in the guest.
-
-However, doing a hypercall for each access is extremely inefficient
-(even more so when running KVM) when accessing the frame buffer. In
-that case, things like scrolling become unusably slow.
-
-This hypercall allows the guest to request a "memory op" to be applied
-to memory. The supported memory ops at this point are to copy a range
-of memory (supports overlap of source and destination) and XOR which
-is used by our SLOF firmware to invert the screen.
-
-Arguments:
-
- r3: H_LOGICAL_MEMOP (0xf001)
- r4: Guest physical address of destination
- r5: Guest physical address of source
- r6: Individual element size
- 0 = 1 byte
- 1 = 2 bytes
- 2 = 4 bytes
- 3 = 8 bytes
- r7: Number of elements
- r8: Operation
- 0 = copy
- 1 = xor
-
-Returns:
-
- H_SUCCESS : Success
- H_PARAMETER : Invalid argument
-
diff --git a/docs/specs/ppc-spapr-hotplug.rst b/docs/specs/ppc-spapr-hotplug.rst
new file mode 100644
index 0000000000..f84dc55ad9
--- /dev/null
+++ b/docs/specs/ppc-spapr-hotplug.rst
@@ -0,0 +1,510 @@
+=============================
+sPAPR Dynamic Reconfiguration
+=============================
+
+sPAPR or pSeries guests make use of a facility called dynamic reconfiguration
+to handle hot plugging of dynamic "physical" resources like PCI cards, or
+"logical"/para-virtual resources like memory, CPUs, and "physical"
+host-bridges, which are generally managed by the host/hypervisor and provided
+to guests as virtualized resources. The specifics of dynamic reconfiguration
+are documented extensively in section 13 of the Linux on Power Architecture
+Reference document ([LoPAR]_). This document provides a summary of that
+information as it applies to the implementation within QEMU.
+
+Dynamic-reconfiguration Connectors
+==================================
+
+To manage hot plug/unplug of these resources, a firmware abstraction known as
+a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
+resource to the guest, and provide an interface for the guest to manage
+configuration/removal of the resource associated with it.
+
+Device tree description of DRCs
+===============================
+
+A set of four Open Firmware device tree array properties are used to describe
+the name/index/power-domain/type of each DRC allocated to a guest at
+boot time. There may be multiple sets of these arrays, rooted at different
+paths in the device tree depending on the type of resource the DRCs manage.
+
+In some cases, the DRCs themselves may be provided by a dynamic resource,
+such as the DRCs managing PCI slots on a hot plugged PHB. In this case the
+arrays would be fetched as part of the device tree retrieval interfaces
+for hot plugged resources described under :ref:`guest-host-interface`.
+
+The array properties are described below. Each entry/element in an array
+describes the DRC identified by the element in the corresponding position
+of ``ibm,drc-indexes``:
+
+``ibm,drc-names``
+-----------------
+
+ First 4-bytes: big-endian (BE) encoded integer denoting the number of entries.
+
+ Each entry: a NULL-terminated ``<name>`` string encoded as a byte array.
+
+ ``<name>`` values for logical/virtual resources are defined in the Linux on
+ Power Architecture Reference ([LoPAR]_) section 13.5.2.4, and basically
+ consist of the type of the resource followed by a space and a numerical
+ value that's unique across resources of that type.
+
+ ``<name>`` values for "physical" resources such as PCI or VIO devices are
+ defined as being "location codes", which are the "location labels" of each
+ encapsulating device, starting from the chassis down to the individual slot
+ for the device, concatenated by a hyphen. This provides a mapping of
+ resources to a physical location in a chassis for debugging purposes. For
+ QEMU, this mapping is less important, so we assign a location code that
+ conforms to naming specifications, but is simply a location label for the
+ slot by itself to simplify the implementation. The naming convention for
+ location labels is documented in detail in the [LoPAR]_ section 12.3.1.5,
+ and in our case amounts to using ``C<n>`` for PCI/VIO device slots, where
+ ``<n>`` is unique across all PCI/VIO device slots.
+
+``ibm,drc-indexes``
+-------------------
+
+ First 4-bytes: BE-encoded integer denoting the number of entries.
+
+ Each 4-byte entry: BE-encoded ``<index>`` integer that is unique across all
+ DRCs in the machine.
+
+ ``<index>`` is arbitrary, but in the case of QEMU we try to maintain the
+ convention used to assign them to pSeries guests on pHyp (the hypervisor
+ portion of PowerVM):
+
+ ``bit[31:28]``: integer encoding of ``<type>``, where ``<type>`` is:
+
+ ``1`` for CPU resource.
+
+ ``2`` for PHB resource.
+
+ ``3`` for VIO resource.
+
+ ``4`` for PCI resource.
+
+ ``8`` for memory resource.
+
+ ``bit[27:0]``: integer encoding of ``<id>``, where ``<id>`` is unique
+ across all resources of specified type.
+
+``ibm,drc-power-domains``
+-------------------------
+
+ First 4-bytes: BE-encoded integer denoting the number of entries.
+
+ Each 4-byte entry: 32-bit, BE-encoded ``<index>`` integer that specifies the
+ power domain the resource will be assigned to. In the case of QEMU we
+ associated all resources with a "live insertion" domain, where the power is
+ assumed to be managed automatically. The integer value for this domain is a
+ special value of ``-1``.
+
+
+``ibm,drc-types``
+-----------------
+
+ First 4-bytes: BE-encoded integer denoting the number of entries.
+
+ Each entry: a NULL-terminated ``<type>`` string encoded as a byte array.
+ ``<type>`` is assigned as follows:
+
+ "CPU" for a CPU.
+
+ "PHB" for a physical host-bridge.
+
+ "SLOT" for a VIO slot.
+
+ "28" for a PCI slot.
+
+ "MEM" for memory resource.
+
+.. _guest-host-interface:
+
+Guest->Host interface to manage dynamic resources
+=================================================
+
+Each DRC is given a globally unique DRC index, and resources associated with a
+particular DRC are configured/managed by the guest via a number of RTAS calls
+which reference individual DRCs based on the DRC index. This can be considered
+the guest->host interface.
+
+``rtas-set-power-level``
+------------------------
+
+Set the power level for a specified power domain.
+
+ ``arg[0]``: integer identifying power domain.
+
+ ``arg[1]``: new power level for the domain, ``0-100``.
+
+ ``output[0]``: status, ``0`` on success.
+
+ ``output[1]``: power level after command.
+
+``rtas-get-power-level``
+------------------------
+
+Get the power level for a specified power domain.
+
+ ``arg[0]``: integer identifying power domain.
+
+ ``output[0]``: status, ``0`` on success.
+
+ ``output[1]``: current power level.
+
+``rtas-set-indicator``
+----------------------
+
+Set the state of an indicator or sensor.
+
+ ``arg[0]``: integer identifying sensor/indicator type.
+
+ ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC
+ index.
+
+ ``arg[2]``: desired sensor value.
+
+ ``output[0]``: status, ``0`` on success.
+
+For the purpose of this document we focus on the indicator/sensor types
+associated with a DRC. The types are:
+
+* ``9001``: ``isolation-state``, controls/indicates whether a device has been
+ made accessible to a guest. Supported sensor values:
+
+ ``0``: ``isolate``, device is made inaccessible by guest OS.
+
+ ``1``: ``unisolate``, device is made available to guest OS.
+
+* ``9002``: ``dr-indicator``, controls "visual" indicator associated with
+ device. Supported sensor values:
+
+ ``0``: ``inactive``, resource may be safely removed.
+
+ ``1``: ``active``, resource is in use and cannot be safely removed.
+
+ ``2``: ``identify``, used to visually identify slot for interactive hot plug.
+
+ ``3``: ``action``, in most cases, used in the same manner as identify.
+
+* ``9003``: ``allocation-state``, generally only used for "logical" DR resources
+ to request the allocation/deallocation of a resource prior to acquiring it via
+ ``isolation-state->unisolate``, or after releasing it via
+ ``isolation-state->isolate``, respectively. For "physical" DR (like PCI
+ hot plug/unplug) the pre-allocation of the resource is implied and this sensor
+ is unused. Supported sensor values:
+
+ ``0``: ``unusable``, tell firmware/system the resource can be
+ unallocated/reclaimed and added back to the system resource pool.
+
+ ``1``: ``usable``, request the resource be allocated/reserved for use by
+ guest OS.
+
+ ``2``: ``exchange``, used to allocate a spare resource to use for fail-over
+ in certain situations. Unused in QEMU.
+
+ ``3``: ``recover``, used to reclaim a previously allocated resource that's
+ not currently allocated to the guest OS. Unused in QEMU.
+
+``rtas-get-sensor-state:``
+--------------------------
+
+Used to read an indicator or sensor value.
+
+ ``arg[0]``: integer identifying sensor/indicator type.
+
+ ``arg[1]``: index of sensor, for DR-related sensors this is generally the DRC
+ index
+
+ ``output[0]``: status, 0 on success
+
+For DR-related operations, the only noteworthy sensor is ``dr-entity-sense``,
+which has a type value of ``9003``, as ``allocation-state`` does in the case of
+``rtas-set-indicator``. The semantics/encodings of the sensor values are
+distinct however.
+
+Supported sensor values for ``dr-entity-sense`` (``9003``) sensor:
+
+ ``0``: empty.
+
+ For physical resources: DRC/slot is empty.
+
+ For logical resources: unused.
+
+ ``1``: present.
+
+ For physical resources: DRC/slot is populated with a device/resource.
+
+ For logical resources: resource has been allocated to the DRC.
+
+ ``2``: unusable.
+
+ For physical resources: unused.
+
+ For logical resources: DRC has no resource allocated to it.
+
+ ``3``: exchange.
+
+ For physical resources: unused.
+
+ For logical resources: resource available for exchange (see
+ ``allocation-state`` sensor semantics above).
+
+ ``4``: recovery.
+
+ For physical resources: unused.
+
+ For logical resources: resource available for recovery (see
+ ``allocation-state`` sensor semantics above).
+
+``rtas-ibm-configure-connector``
+--------------------------------
+
+Used to fetch an OpenFirmware device tree description of the resource associated
+with a particular DRC.
+
+ ``arg[0]``: guest physical address of 4096-byte work area buffer.
+
+ ``arg[1]``: 0, or address of additional 4096-byte work area buffer; only
+ non-zero if a prior RTAS response indicated a need for additional memory.
+
+ ``output[0]``: status:
+
+ ``0``: completed transmittal of device tree node.
+
+ ``1``: instruct guest to prepare for next device tree sibling node.
+
+ ``2``: instruct guest to prepare for next device tree child node.
+
+ ``3``: instruct guest to prepare for next device tree property.
+
+ ``4``: instruct guest to ascend to parent device tree node.
+
+ ``5``: instruct guest to provide additional work-area buffer via ``arg[1]``.
+
+ ``990x``: instruct guest that operation took too long and to try again
+ later.
+
+The DRC index is encoded in the first 4-bytes of the first work area buffer.
+Work area (``wa``) layout, using 4-byte offsets:
+
+ ``wa[0]``: DRC index of the DRC to fetch device tree nodes from.
+
+ ``wa[1]``: ``0`` (hard-coded).
+
+ ``wa[2]``:
+
+ For next-sibling/next-child response:
+
+ ``wa`` offset of null-terminated string denoting the new node's name.
+
+ For next-property response:
+
+ ``wa`` offset of null-terminated string denoting new property's name.
+
+ ``wa[3]``: for next-property response (unused otherwise):
+
+ Byte-length of new property's value.
+
+ ``wa[4]``: for next-property response (unused otherwise):
+
+ New property's value, encoded as an OFDT-compatible byte array.
+
+Hot plug/unplug events
+======================
+
+For most DR operations, the hypervisor will issue host->guest add/remove events
+using the EPOW/check-exception notification framework, where the host issues a
+check-exception interrupt, then provides an RTAS event log via an
+rtas-check-exception call issued by the guest in response. This framework is
+documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
+requests via EPOW events.
+
+For DR, this framework has been extended to include hotplug events, which were
+previously unneeded due to direct manipulation of DR-related guest userspace
+tools by host-level management such as an HMC. This level of management is not
+applicable to KVM on Power, hence the reason for extending the notification
+framework to support hotplug events.
+
+The format for these EPOW-signalled events is described below under
+:ref:`hot-plug-unplug-event-structure`. Note that these events are not formally
+part of the PAPR+ specification, and have been superseded by a newer format,
+also described below under :ref:`hot-plug-unplug-event-structure`, and so are
+now deemed a "legacy" format. The formats are similar, but the "modern" format
+contains additional fields/flags, which are denoted for the purposes of this
+documentation with ``#ifdef GUEST_SUPPORTS_MODERN`` guards.
+
+QEMU should assume support only for "legacy" fields/flags unless the guest
+advertises support for the "modern" format via
+``ibm,client-architecture-support`` hcall by setting byte 5, bit 6 of it's
+``ibm,architecture-vec-5`` option vector structure (as described by [LoPAR]_,
+section B.5.2.3). As with "legacy" format events, "modern" format events are
+surfaced to the guest via check-exception RTAS calls, but use a dedicated event
+source to signal the guest. This event source is advertised to the guest by the
+addition of a ``hot-plug-events`` node under ``/event-sources`` node of the
+guest's device tree using the standard format described in [LoPAR]_,
+section B.5.12.2.
+
+.. _hot-plug-unplug-event-structure:
+
+Hot plug/unplug event structure
+===============================
+
+The hot plug specific payload in QEMU is implemented as follows (with all values
+encoded in big-endian format):
+
+.. code-block:: c
+
+ struct rtas_event_log_v6_hp {
+ #define SECTION_ID_HOTPLUG 0x4850 /* HP */
+ struct section_header {
+ uint16_t section_id; /* set to SECTION_ID_HOTPLUG */
+ uint16_t section_length; /* sizeof(rtas_event_log_v6_hp),
+ * plus the length of the DRC name
+ * if a DRC name identifier is
+ * specified for hotplug_identifier
+ */
+ uint8_t section_version; /* version 1 */
+ uint8_t section_subtype; /* unused */
+ uint16_t creator_component_id; /* unused */
+ } hdr;
+ #define RTAS_LOG_V6_HP_TYPE_CPU 1
+ #define RTAS_LOG_V6_HP_TYPE_MEMORY 2
+ #define RTAS_LOG_V6_HP_TYPE_SLOT 3
+ #define RTAS_LOG_V6_HP_TYPE_PHB 4
+ #define RTAS_LOG_V6_HP_TYPE_PCI 5
+ uint8_t hotplug_type; /* type of resource/device */
+ #define RTAS_LOG_V6_HP_ACTION_ADD 1
+ #define RTAS_LOG_V6_HP_ACTION_REMOVE 2
+ uint8_t hotplug_action; /* action (add/remove) */
+ #define RTAS_LOG_V6_HP_ID_DRC_NAME 1
+ #define RTAS_LOG_V6_HP_ID_DRC_INDEX 2
+ #define RTAS_LOG_V6_HP_ID_DRC_COUNT 3
+ #ifdef GUEST_SUPPORTS_MODERN
+ #define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4
+ #endif
+ uint8_t hotplug_identifier; /* type of the resource identifier,
+ * which serves as the discriminator
+ * for the 'drc' union field below
+ */
+ #ifdef GUEST_SUPPORTS_MODERN
+ uint8_t capabilities; /* capability flags, currently unused
+ * by QEMU
+ */
+ #else
+ uint8_t reserved;
+ #endif
+ union {
+ uint32_t index; /* DRC index of resource to take action
+ * on
+ */
+ uint32_t count; /* number of DR resources to take
+ * action on (guest chooses which)
+ */
+ #ifdef GUEST_SUPPORTS_MODERN
+ struct {
+ uint32_t count; /* number of DR resources to take
+ * action on
+ */
+ uint32_t index; /* DRC index of first resource to take
+ * action on. guest will take action
+ * on DRC index <index> through
+ * DRC index <index + count - 1> in
+ * sequential order
+ */
+ } count_indexed;
+ #endif
+ char name[1]; /* string representing the name of the
+ * DRC to take action on
+ */
+ } drc;
+ } QEMU_PACKED;
+
+``ibm,lrdr-capacity``
+=====================
+
+``ibm,lrdr-capacity`` is a property in the /rtas device tree node that
+identifies the dynamic reconfiguration capabilities of the guest. It consists
+of a triple consisting of ``<phys>``, ``<size>`` and ``<maxcpus>``.
+
+ ``<phys>``, encoded in BE format represents the maximum address in bytes and
+ hence the maximum memory that can be allocated to the guest.
+
+ ``<size>``, encoded in BE format represents the size increments in which
+ memory can be hot-plugged to the guest.
+
+ ``<maxcpus>``, a BE-encoded integer, represents the maximum number of
+ processors that the guest can have.
+
+``pseries`` guests use this property to note the maximum allowed CPUs for the
+guest.
+
+``ibm,dynamic-reconfiguration-memory``
+======================================
+
+``ibm,dynamic-reconfiguration-memory`` is a device tree node that represents
+dynamically reconfigurable logical memory blocks (LMB). This node is generated
+only when the guest advertises the support for it via
+``ibm,client-architecture-support`` call. Memory that is not dynamically
+reconfigurable is represented by ``/memory`` nodes. The properties of this node
+that are of interest to the sPAPR memory hotplug implementation in QEMU are
+described here.
+
+``ibm,lmb-size``
+----------------
+
+This 64-bit integer defines the size of each dynamically reconfigurable LMB.
+
+``ibm,associativity-lookup-arrays``
+-----------------------------------
+
+This property defines a lookup array in which the NUMA associativity
+information for each LMB can be found. It is a property encoded array
+that begins with an integer M, the number of associativity lists followed
+by an integer N, the number of entries per associativity list and terminated
+by M associativity lists each of length N integers.
+
+This property provides the same information as given by ``ibm,associativity``
+property in a ``/memory`` node. Each assigned LMB has an index value between
+0 and M-1 which is used as an index into this table to select which
+associativity list to use for the LMB. This index value for each LMB is defined
+in ``ibm,dynamic-memory`` property.
+
+``ibm,dynamic-memory``
+----------------------
+
+This property describes the dynamically reconfigurable memory. It is a
+property encoded array that has an integer N, the number of LMBs followed
+by N LMB list entries.
+
+Each LMB list entry consists of the following elements:
+
+- Logical address of the start of the LMB encoded as a 64-bit integer. This
+ corresponds to ``reg`` property in ``/memory`` node.
+- DRC index of the LMB that corresponds to ``ibm,my-drc-index`` property
+ in a ``/memory`` node.
+- Four bytes reserved for expansion.
+- Associativity list index for the LMB that is used as an index into
+ ``ibm,associativity-lookup-arrays`` property described earlier. This is used
+ to retrieve the right associativity list to be used for this LMB.
+- A 32-bit flags word. The bit at bit position ``0x00000008`` defines whether
+ the LMB is assigned to the partition as of boot time.
+
+``ibm,dynamic-memory-v2``
+-------------------------
+
+This property describes the dynamically reconfigurable memory. This is
+an alternate and newer way to describe dynamically reconfigurable memory.
+It is a property encoded array that has an integer N (the number of
+LMB set entries) followed by N LMB set entries. There is an LMB set entry
+for each sequential group of LMBs that share common attributes.
+
+Each LMB set entry consists of the following elements:
+
+- Number of sequential LMBs in the entry represented by a 32-bit integer.
+- Logical address of the first LMB in the set encoded as a 64-bit integer.
+- DRC index of the first LMB in the set.
+- Associativity list index that is used as an index into
+ ``ibm,associativity-lookup-arrays`` property described earlier. This
+ is used to retrieve the right associativity list to be used for all
+ the LMBs in this set.
+- A 32-bit flags word that applies to all the LMBs in the set.
diff --git a/docs/specs/ppc-spapr-hotplug.txt b/docs/specs/ppc-spapr-hotplug.txt
deleted file mode 100644
index d4fb2d46d9..0000000000
--- a/docs/specs/ppc-spapr-hotplug.txt
+++ /dev/null
@@ -1,409 +0,0 @@
-= sPAPR Dynamic Reconfiguration =
-
-sPAPR/"pseries" guests make use of a facility called dynamic-reconfiguration
-to handle hotplugging of dynamic "physical" resources like PCI cards, or
-"logical"/paravirtual resources like memory, CPUs, and "physical"
-host-bridges, which are generally managed by the host/hypervisor and provided
-to guests as virtualized resources. The specifics of dynamic-reconfiguration
-are documented extensively in PAPR+ v2.7, Section 13.1. This document
-provides a summary of that information as it applies to the implementation
-within QEMU.
-
-== Dynamic-reconfiguration Connectors ==
-
-To manage hotplug/unplug of these resources, a firmware abstraction known as
-a Dynamic Resource Connector (DRC) is used to assign a particular dynamic
-resource to the guest, and provide an interface for the guest to manage
-configuration/removal of the resource associated with it.
-
-== Device-tree description of DRCs ==
-
-A set of 4 Open Firmware device tree array properties are used to describe
-the name/index/power-domain/type of each DRC allocated to a guest at
-boot-time. There may be multiple sets of these arrays, rooted at different
-paths in the device tree depending on the type of resource the DRCs manage.
-
-In some cases, the DRCs themselves may be provided by a dynamic resource,
-such as the DRCs managing PCI slots on a hotplugged PHB. In this case the
-arrays would be fetched as part of the device tree retrieval interfaces
-for hotplugged resources described under "Guest->Host interface".
-
-The array properties are described below. Each entry/element in an array
-describes the DRC identified by the element in the corresponding position
-of ibm,drc-indexes:
-
-ibm,drc-names:
- first 4-bytes: BE-encoded integer denoting the number of entries
- each entry: a NULL-terminated <name> string encoded as a byte array
-
- <name> values for logical/virtual resources are defined in PAPR+ v2.7,
- Section 13.5.2.4, and basically consist of the type of the resource
- followed by a space and a numerical value that's unique across resources
- of that type.
-
- <name> values for "physical" resources such as PCI or VIO devices are
- defined as being "location codes", which are the "location labels" of
- each encapsulating device, starting from the chassis down to the
- individual slot for the device, concatenated by a hyphen. This provides
- a mapping of resources to a physical location in a chassis for debugging
- purposes. For QEMU, this mapping is less important, so we assign a
- location code that conforms to naming specifications, but is simply a
- location label for the slot by itself to simplify the implementation.
- The naming convention for location labels is documented in detail in
- PAPR+ v2.7, Section 12.3.1.5, and in our case amounts to using "C<n>"
- for PCI/VIO device slots, where <n> is unique across all PCI/VIO
- device slots.
-
-ibm,drc-indexes:
- first 4-bytes: BE-encoded integer denoting the number of entries
- each 4-byte entry: BE-encoded <index> integer that is unique across all DRCs
- in the machine
-
- <index> is arbitrary, but in the case of QEMU we try to maintain the
- convention used to assign them to pSeries guests on pHyp:
-
- bit[31:28]: integer encoding of <type>, where <type> is:
- 1 for CPU resource
- 2 for PHB resource
- 3 for VIO resource
- 4 for PCI resource
- 8 for Memory resource
- bit[27:0]: integer encoding of <id>, where <id> is unique across
- all resources of specified type
-
-ibm,drc-power-domains:
- first 4-bytes: BE-encoded integer denoting the number of entries
- each 4-byte entry: 32-bit, BE-encoded <index> integer that specifies the
- power domain the resource will be assigned to. In the case of QEMU
- we associated all resources with a "live insertion" domain, where the
- power is assumed to be managed automatically. The integer value for
- this domain is a special value of -1.
-
-
-ibm,drc-types:
- first 4-bytes: BE-encoded integer denoting the number of entries
- each entry: a NULL-terminated <type> string encoded as a byte array
-
- <type> is assigned as follows:
- "CPU" for a CPU
- "PHB" for a physical host-bridge
- "SLOT" for a VIO slot
- "28" for a PCI slot
- "MEM" for memory resource
-
-== Guest->Host interface to manage dynamic resources ==
-
-Each DRC is given a globally unique DRC Index, and resources associated with
-a particular DRC are configured/managed by the guest via a number of RTAS
-calls which reference individual DRCs based on the DRC index. This can be
-considered the guest->host interface.
-
-rtas-set-power-level:
- arg[0]: integer identifying power domain
- arg[1]: new power level for the domain, 0-100
- output[0]: status, 0 on success
- output[1]: power level after command
-
- Set the power level for a specified power domain
-
-rtas-get-power-level:
- arg[0]: integer identifying power domain
- output[0]: status, 0 on success
- output[1]: current power level
-
- Get the power level for a specified power domain
-
-rtas-set-indicator:
- arg[0]: integer identifying sensor/indicator type
- arg[1]: index of sensor, for DR-related sensors this is generally the
- DRC index
- arg[2]: desired sensor value
- output[0]: status, 0 on success
-
- Set the state of an indicator or sensor. For the purpose of this document we
- focus on the indicator/sensor types associated with a DRC. The types are:
-
- 9001: isolation-state, controls/indicates whether a device has been made
- accessible to a guest
-
- supported sensor values:
- 0: isolate, device is made unaccessible by guest OS
- 1: unisolate, device is made available to guest OS
-
- 9002: dr-indicator, controls "visual" indicator associated with device
-
- supported sensor values:
- 0: inactive, resource may be safely removed
- 1: active, resource is in use and cannot be safely removed
- 2: identify, used to visually identify slot for interactive hotplug
- 3: action, in most cases, used in the same manner as identify
-
- 9003: allocation-state, generally only used for "logical" DR resources to
- request the allocation/deallocation of a resource prior to acquiring
- it via isolation-state->unisolate, or after releasing it via
- isolation-state->isolate, respectively. for "physical" DR (like PCI
- hotplug/unplug) the pre-allocation of the resource is implied and
- this sensor is unused.
-
- supported sensor values:
- 0: unusable, tell firmware/system the resource can be
- unallocated/reclaimed and added back to the system resource pool
- 1: usable, request the resource be allocated/reserved for use by
- guest OS
- 2: exchange, used to allocate a spare resource to use for fail-over
- in certain situations. unused in QEMU
- 3: recover, used to reclaim a previously allocated resource that's
- not currently allocated to the guest OS. unused in QEMU
-
-rtas-get-sensor-state:
- arg[0]: integer identifying sensor/indicator type
- arg[1]: index of sensor, for DR-related sensors this is generally the
- DRC index
- output[0]: status, 0 on success
-
- Used to read an indicator or sensor value.
-
- For DR-related operations, the only noteworthy sensor is dr-entity-sense,
- which has a type value of 9003, as allocation-state does in the case of
- rtas-set-indicator. The semantics/encodings of the sensor values are distinct
- however:
-
- supported sensor values for dr-entity-sense (9003) sensor:
- 0: empty,
- for physical resources: DRC/slot is empty
- for logical resources: unused
- 1: present,
- for physical resources: DRC/slot is populated with a device/resource
- for logical resources: resource has been allocated to the DRC
- 2: unusable,
- for physical resources: unused
- for logical resources: DRC has no resource allocated to it
- 3: exchange,
- for physical resources: unused
- for logical resources: resource available for exchange (see
- allocation-state sensor semantics above)
- 4: recovery,
- for physical resources: unused
- for logical resources: resource available for recovery (see
- allocation-state sensor semantics above)
-
-rtas-ibm-configure-connector:
- arg[0]: guest physical address of 4096-byte work area buffer
- arg[1]: 0, or address of additional 4096-byte work area buffer. only non-zero
- if a prior RTAS response indicated a need for additional memory
- output[0]: status:
- 0: completed transmittal of device-tree node
- 1: instruct guest to prepare for next DT sibling node
- 2: instruct guest to prepare for next DT child node
- 3: instruct guest to prepare for next DT property
- 4: instruct guest to ascend to parent DT node
- 5: instruct guest to provide additional work-area buffer
- via arg[1]
- 990x: instruct guest that operation took too long and to try
- again later
-
- Used to fetch an OF device-tree description of the resource associated with
- a particular DRC. The DRC index is encoded in the first 4-bytes of the first
- work area buffer.
-
- Work area layout, using 4-byte offsets:
- wa[0]: DRC index of the DRC to fetch device-tree nodes from
- wa[1]: 0 (hard-coded)
- wa[2]: for next-sibling/next-child response:
- wa offset of null-terminated string denoting the new node's name
- for next-property response:
- wa offset of null-terminated string denoting new property's name
- wa[3]: for next-property response (unused otherwise):
- byte-length of new property's value
- wa[4]: for next-property response (unused otherwise):
- new property's value, encoded as an OFDT-compatible byte array
-
-== hotplug/unplug events ==
-
-For most DR operations, the hypervisor will issue host->guest add/remove events
-using the EPOW/check-exception notification framework, where the host issues a
-check-exception interrupt, then provides an RTAS event log via an
-rtas-check-exception call issued by the guest in response. This framework is
-documented by PAPR+ v2.7, and already use in by QEMU for generating powerdown
-requests via EPOW events.
-
-For DR, this framework has been extended to include hotplug events, which were
-previously unneeded due to direct manipulation of DR-related guest userspace
-tools by host-level management such as an HMC. This level of management is not
-applicable to PowerKVM, hence the reason for extending the notification
-framework to support hotplug events.
-
-The format for these EPOW-signalled events is described below under
-"hotplug/unplug event structure". Note that these events are not
-formally part of the PAPR+ specification, and have been superseded by a
-newer format, also described below under "hotplug/unplug event structure",
-and so are now deemed a "legacy" format. The formats are similar, but the
-"modern" format contains additional fields/flags, which are denoted for the
-purposes of this documentation with "#ifdef GUEST_SUPPORTS_MODERN" guards.
-
-QEMU should assume support only for "legacy" fields/flags unless the guest
-advertises support for the "modern" format via ibm,client-architecture-support
-hcall by setting byte 5, bit 6 of it's ibm,architecture-vec-5 option vector
-structure (as described by LoPAPR v11, B.6.2.3). As with "legacy" format events,
-"modern" format events are surfaced to the guest via check-exception RTAS calls,
-but use a dedicated event source to signal the guest. This event source is
-advertised to the guest by the addition of a "hot-plug-events" node under
-"/event-sources" node of the guest's device tree using the standard format
-described in LoPAPR v11, B.6.12.1.
-
-== hotplug/unplug event structure ==
-
-The hotplug-specific payload in QEMU is implemented as follows (with all values
-encoded in big-endian format):
-
-struct rtas_event_log_v6_hp {
-#define SECTION_ID_HOTPLUG 0x4850 /* HP */
- struct section_header {
- uint16_t section_id; /* set to SECTION_ID_HOTPLUG */
- uint16_t section_length; /* sizeof(rtas_event_log_v6_hp),
- * plus the length of the DRC name
- * if a DRC name identifier is
- * specified for hotplug_identifier
- */
- uint8_t section_version; /* version 1 */
- uint8_t section_subtype; /* unused */
- uint16_t creator_component_id; /* unused */
- } hdr;
-#define RTAS_LOG_V6_HP_TYPE_CPU 1
-#define RTAS_LOG_V6_HP_TYPE_MEMORY 2
-#define RTAS_LOG_V6_HP_TYPE_SLOT 3
-#define RTAS_LOG_V6_HP_TYPE_PHB 4
-#define RTAS_LOG_V6_HP_TYPE_PCI 5
- uint8_t hotplug_type; /* type of resource/device */
-#define RTAS_LOG_V6_HP_ACTION_ADD 1
-#define RTAS_LOG_V6_HP_ACTION_REMOVE 2
- uint8_t hotplug_action; /* action (add/remove) */
-#define RTAS_LOG_V6_HP_ID_DRC_NAME 1
-#define RTAS_LOG_V6_HP_ID_DRC_INDEX 2
-#define RTAS_LOG_V6_HP_ID_DRC_COUNT 3
-#ifdef GUEST_SUPPORTS_MODERN
-#define RTAS_LOG_V6_HP_ID_DRC_COUNT_INDEXED 4
-#endif
- uint8_t hotplug_identifier; /* type of the resource identifier,
- * which serves as the discriminator
- * for the 'drc' union field below
- */
-#ifdef GUEST_SUPPORTS_MODERN
- uint8_t capabilities; /* capability flags, currently unused
- * by QEMU
- */
-#else
- uint8_t reserved;
-#endif
- union {
- uint32_t index; /* DRC index of resource to take action
- * on
- */
- uint32_t count; /* number of DR resources to take
- * action on (guest chooses which)
- */
-#ifdef GUEST_SUPPORTS_MODERN
- struct {
- uint32_t count; /* number of DR resources to take
- * action on
- */
- uint32_t index; /* DRC index of first resource to take
- * action on. guest will take action
- * on DRC index <index> through
- * DRC index <index + count - 1> in
- * sequential order
- */
- } count_indexed;
-#endif
- char name[1]; /* string representing the name of the
- * DRC to take action on
- */
- } drc;
-} QEMU_PACKED;
-
-== ibm,lrdr-capacity ==
-
-ibm,lrdr-capacity is a property in the /rtas device tree node that identifies
-the dynamic reconfiguration capabilities of the guest. It consists of a triple
-consisting of <phys>, <size> and <maxcpus>.
-
- <phys>, encoded in BE format represents the maximum address in bytes and
- hence the maximum memory that can be allocated to the guest.
-
- <size>, encoded in BE format represents the size increments in which
- memory can be hot-plugged to the guest.
-
- <maxcpus>, a BE-encoded integer, represents the maximum number of
- processors that the guest can have.
-
-pseries guests use this property to note the maximum allowed CPUs for the
-guest.
-
-== ibm,dynamic-reconfiguration-memory ==
-
-ibm,dynamic-reconfiguration-memory is a device tree node that represents
-dynamically reconfigurable logical memory blocks (LMB). This node
-is generated only when the guest advertises the support for it via
-ibm,client-architecture-support call. Memory that is not dynamically
-reconfigurable is represented by /memory nodes. The properties of this
-node that are of interest to the sPAPR memory hotplug implementation
-in QEMU are described here.
-
-ibm,lmb-size
-
-This 64bit integer defines the size of each dynamically reconfigurable LMB.
-
-ibm,associativity-lookup-arrays
-
-This property defines a lookup array in which the NUMA associativity
-information for each LMB can be found. It is a property encoded array
-that begins with an integer M, the number of associativity lists followed
-by an integer N, the number of entries per associativity list and terminated
-by M associativity lists each of length N integers.
-
-This property provides the same information as given by ibm,associativity
-property in a /memory node. Each assigned LMB has an index value between
-0 and M-1 which is used as an index into this table to select which
-associativity list to use for the LMB. This index value for each LMB
-is defined in ibm,dynamic-memory property.
-
-ibm,dynamic-memory
-
-This property describes the dynamically reconfigurable memory. It is a
-property encoded array that has an integer N, the number of LMBs followed
-by N LMB list entries.
-
-Each LMB list entry consists of the following elements:
-
-- Logical address of the start of the LMB encoded as a 64bit integer. This
- corresponds to reg property in /memory node.
-- DRC index of the LMB that corresponds to ibm,my-drc-index property
- in a /memory node.
-- Four bytes reserved for expansion.
-- Associativity list index for the LMB that is used as an index into
- ibm,associativity-lookup-arrays property described earlier. This
- is used to retrieve the right associativity list to be used for this
- LMB.
-- A 32bit flags word. The bit at bit position 0x00000008 defines whether
- the LMB is assigned to the partition as of boot time.
-
-ibm,dynamic-memory-v2
-
-This property describes the dynamically reconfigurable memory. This is
-an alternate and newer way to describe dynamically reconfigurable memory.
-It is a property encoded array that has an integer N (the number of
-LMB set entries) followed by N LMB set entries. There is an LMB set entry
-for each sequential group of LMBs that share common attributes.
-
-Each LMB set entry consists of the following elements:
-
-- Number of sequential LMBs in the entry represented by a 32bit integer.
-- Logical address of the first LMB in the set encoded as a 64bit integer.
-- DRC index of the first LMB in the set.
-- Associativity list index that is used as an index into
- ibm,associativity-lookup-arrays property described earlier. This
- is used to retrieve the right associativity list to be used for all
- the LMBs in this set.
-- A 32bit flags word that applies to all the LMBs in the set.
-
-[1] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/75350/focus=106867
diff --git a/docs/specs/ppc-spapr-uv-hcalls.rst b/docs/specs/ppc-spapr-uv-hcalls.rst
new file mode 100644
index 0000000000..a00288deb3
--- /dev/null
+++ b/docs/specs/ppc-spapr-uv-hcalls.rst
@@ -0,0 +1,89 @@
+===================================
+Hypervisor calls and the Ultravisor
+===================================
+
+On PPC64 systems supporting Protected Execution Facility (PEF), system memory
+can be placed in a secured region where only an ultravisor running in firmware
+can provide access to. pSeries guests on such systems can communicate with
+the ultravisor (via ultracalls) to switch to a secure virtual machine (SVM) mode
+where the guest's memory is relocated to this secured region, making its memory
+inaccessible to normal processes/guests running on the host.
+
+The various ultracalls/hypercalls relating to SVM mode are currently only
+documented internally, but are planned for direct inclusion into the Linux on
+Power Architecture Reference document ([LoPAR]_). An internal ACR has been filed
+to reserve a hypercall number range specific to this use case to avoid any
+future conflicts with the IBM internally maintained Power Architecture Platform
+Reference (PAPR+) documentation specification. This document summarizes some of
+these details as they relate to QEMU.
+
+Hypercalls needed by the ultravisor
+===================================
+
+Switching to SVM mode involves a number of hcalls issued by the ultravisor to
+the hypervisor to orchestrate the movement of guest memory to secure memory and
+various other aspects of the SVM mode. Numbers are assigned for these hcalls
+within the reserved range ``0xEF00-0xEF80``. The below documents the hcalls
+relevant to QEMU.
+
+``H_TPM_COMM`` (``0xef10``)
+---------------------------
+
+SVM file systems are encrypted using a symmetric key. This key is then
+wrapped/encrypted using the public key of a trusted system which has the private
+key stored in the system's TPM. An Ultravisor will use this hcall to
+unwrap/unseal the symmetric key using the system's TPM device or a TPM Resource
+Manager associated with the device.
+
+The Ultravisor sets up a separate session key with the TPM in advance during
+host system boot. All sensitive in and out values will be encrypted using the
+session key. Though the hypervisor will see the in and out buffers in raw form,
+any sensitive contents will generally be encrypted using this session key.
+
+Arguments:
+
+ ``r3``: ``H_TPM_COMM`` (``0xef10``)
+
+ ``r4``: ``TPM`` operation, one of:
+
+ ``TPM_COMM_OP_EXECUTE`` (``0x1``): send a request to a TPM and receive a
+ response, opening a new TPM session if one has not already been opened.
+
+ ``TPM_COMM_OP_CLOSE_SESSION`` (``0x2``): close the existing TPM session, if
+ any.
+
+ ``r5``: ``in_buffer``, guest physical address of buffer containing the
+ request. Caller may use the same address for both request and response.
+
+ ``r6``: ``in_size``, size of the in buffer. Must be less than or equal to
+ 4 KB.
+
+ ``r7``: ``out_buffer``, guest physical address of buffer to store the
+ response. Caller may use the same address for both request and response.
+
+ ``r8``: ``out_size``, size of the out buffer. Must be at least 4 KB, as this
+ is the maximum request/response size supported by most TPM implementations,
+ including the TPM Resource Manager in the linux kernel.
+
+Return values:
+
+ ``r3``: one of the following values:
+
+ ``H_Success``: request processed successfully.
+
+ ``H_PARAMETER``: invalid TPM operation.
+
+ ``H_P2``: ``in_buffer`` is invalid.
+
+ ``H_P3``: ``in_size`` is invalid.
+
+ ``H_P4``: ``out_buffer`` is invalid.
+
+ ``H_P5``: ``out_size`` is invalid.
+
+ ``H_RESOURCE``: problem communicating with TPM.
+
+ ``H_FUNCTION``: TPM access is not currently allowed/configured.
+
+ ``r4``: For ``TPM_COMM_OP_EXECUTE``, the size of the response will be stored
+ here upon success.
diff --git a/docs/specs/ppc-spapr-uv-hcalls.txt b/docs/specs/ppc-spapr-uv-hcalls.txt
deleted file mode 100644
index 389c2740d7..0000000000
--- a/docs/specs/ppc-spapr-uv-hcalls.txt
+++ /dev/null
@@ -1,76 +0,0 @@
-On PPC64 systems supporting Protected Execution Facility (PEF), system
-memory can be placed in a secured region where only an "ultravisor"
-running in firmware can provide to access it. pseries guests on such
-systems can communicate with the ultravisor (via ultracalls) to switch to a
-secure VM mode (SVM) where the guest's memory is relocated to this secured
-region, making its memory inaccessible to normal processes/guests running on
-the host.
-
-The various ultracalls/hypercalls relating to SVM mode are currently
-only documented internally, but are planned for direct inclusion into the
-public OpenPOWER version of the PAPR specification (LoPAPR/LoPAR). An internal
-ACR has been filed to reserve a hypercall number range specific to this
-use-case to avoid any future conflicts with the internally-maintained PAPR
-specification. This document summarizes some of these details as they relate
-to QEMU.
-
-== hypercalls needed by the ultravisor ==
-
-Switching to SVM mode involves a number of hcalls issued by the ultravisor
-to the hypervisor to orchestrate the movement of guest memory to secure
-memory and various other aspects SVM mode. Numbers are assigned for these
-hcalls within the reserved range 0xEF00-0xEF80. The below documents the
-hcalls relevant to QEMU.
-
-- H_TPM_COMM (0xef10)
-
- For TPM_COMM_OP_EXECUTE operation:
- Send a request to a TPM and receive a response, opening a new TPM session
- if one has not already been opened.
-
- For TPM_COMM_OP_CLOSE_SESSION operation:
- Close the existing TPM session, if any.
-
- Arguments:
-
- r3 : H_TPM_COMM (0xef10)
- r4 : TPM operation, one of:
- TPM_COMM_OP_EXECUTE (0x1)
- TPM_COMM_OP_CLOSE_SESSION (0x2)
- r5 : in_buffer, guest physical address of buffer containing the request
- - Caller may use the same address for both request and response
- r6 : in_size, size of the in buffer
- - Must be less than or equal to 4KB
- r7 : out_buffer, guest physical address of buffer to store the response
- - Caller may use the same address for both request and response
- r8 : out_size, size of the out buffer
- - Must be at least 4KB, as this is the maximum request/response size
- supported by most TPM implementations, including the TPM Resource
- Manager in the linux kernel.
-
- Return values:
-
- r3 : H_Success request processed successfully
- H_PARAMETER invalid TPM operation
- H_P2 in_buffer is invalid
- H_P3 in_size is invalid
- H_P4 out_buffer is invalid
- H_P5 out_size is invalid
- H_RESOURCE problem communicating with TPM
- H_FUNCTION TPM access is not currently allowed/configured
- r4 : For TPM_COMM_OP_EXECUTE, the size of the response will be stored here
- upon success.
-
- Use-case/notes:
-
- SVM filesystems are encrypted using a symmetric key. This key is then
- wrapped/encrypted using the public key of a trusted system which has the
- private key stored in the system's TPM. An Ultravisor will use this
- hcall to unwrap/unseal the symmetric key using the system's TPM device
- or a TPM Resource Manager associated with the device.
-
- The Ultravisor sets up a separate session key with the TPM in advance
- during host system boot. All sensitive in and out values will be
- encrypted using the session key. Though the hypervisor will see the 'in'
- and 'out' buffers in raw form, any sensitive contents will generally be
- encrypted using this session key.
diff --git a/docs/specs/pvpanic.txt b/docs/specs/pvpanic.rst
index 8afcde11cc..b0f27860ec 100644
--- a/docs/specs/pvpanic.txt
+++ b/docs/specs/pvpanic.rst
@@ -21,18 +21,23 @@ recognize. On write, the bits not recognized by the device are ignored.
Software should set only bits both itself and the device recognize.
Bit Definition
---------------
-bit 0: a guest panic has happened and should be processed by the host
-bit 1: a guest panic has happened and will be handled by the guest;
- the host should record it or report it, but should not affect
- the execution of the guest.
+~~~~~~~~~~~~~~
+
+bit 0
+ a guest panic has happened and should be processed by the host
+bit 1
+ a guest panic has happened and will be handled by the guest;
+ the host should record it or report it, but should not affect
+ the execution of the guest.
+bit 2 (to be implemented)
+ a regular guest shutdown has happened and should be processed by the host
PCI Interface
-------------
The PCI interface is similar to the ISA interface except that it uses an MMIO
address space provided by its BAR0, 1 byte long. Any machine with a PCI bus
-can enable a pvpanic device by adding '-device pvpanic-pci' to the command
+can enable a pvpanic device by adding ``-device pvpanic-pci`` to the command
line.
ACPI Interface
@@ -40,15 +45,25 @@ ACPI Interface
pvpanic device is defined with ACPI ID "QEMU0001". Custom methods:
-RDPT: To determine whether guest panic notification is supported.
-Arguments: None
-Return: Returns a byte, with the same semantics as the I/O port
- interface.
+RDPT
+~~~~
+
+To determine whether guest panic notification is supported.
+
+Arguments
+ None
+Return
+ Returns a byte, with the same semantics as the I/O port interface.
+
+WRPT
+~~~~
+
+To send a guest panic event.
-WRPT: To send a guest panic event
-Arguments: Arg0 is a byte to be written, with the same semantics as
- the I/O interface.
-Return: None
+Arguments
+ Arg0 is a byte to be written, with the same semantics as the I/O interface.
+Return
+ None
The ACPI device will automatically refer to the right port in case it
is modified.
diff --git a/docs/specs/sev-guest-firmware.rst b/docs/specs/sev-guest-firmware.rst
new file mode 100644
index 0000000000..3f7f082df5
--- /dev/null
+++ b/docs/specs/sev-guest-firmware.rst
@@ -0,0 +1,125 @@
+====================================================
+QEMU/Guest Firmware Interface for AMD SEV and SEV-ES
+====================================================
+
+Overview
+========
+
+The guest firmware image (OVMF) may contain some configuration entries
+which are used by QEMU before the guest launches. These are listed in a
+GUIDed table at a known location in the firmware image. QEMU parses
+this table when it loads the firmware image into memory, and then QEMU
+reads individual entries when their values are needed.
+
+Though nothing in the table structure is SEV-specific, currently all the
+entries in the table are related to SEV and SEV-ES features.
+
+
+Table parsing in QEMU
+---------------------
+
+The table is parsed from the footer: first the presence of the table
+footer GUID (96b582de-1fb2-45f7-baea-a366c55a082d) at 0xffffffd0 is
+verified. If that is found, two bytes at 0xffffffce are the entire
+table length.
+
+Then the table is scanned backwards looking for the specific entry GUID.
+
+QEMU files related to parsing and scanning the OVMF table:
+ - ``hw/i386/pc_sysfw_ovmf.c``
+
+The edk2 firmware code that constructs this structure is in the
+`OVMF Reset Vector file`_.
+
+
+Table memory layout
+-------------------
+
++------------+--------+-----------------------------------------+
+| GPA | Length | Description |
++============+========+=========================================+
+| 0xffffff80 | 4 | Zero padding |
++------------+--------+-----------------------------------------+
+| 0xffffff84 | 4 | SEV hashes table base address |
++------------+--------+-----------------------------------------+
+| 0xffffff88 | 4 | SEV hashes table size (=0x400) |
++------------+--------+-----------------------------------------+
+| 0xffffff8c | 2 | SEV hashes table entry length (=0x1a) |
++------------+--------+-----------------------------------------+
+| 0xffffff8e | 16 | SEV hashes table GUID: |
+| | | 7255371f-3a3b-4b04-927b-1da6efa8d454 |
++------------+--------+-----------------------------------------+
+| 0xffffff9e | 4 | SEV secret block base address |
++------------+--------+-----------------------------------------+
+| 0xffffffa2 | 4 | SEV secret block size (=0xc00) |
++------------+--------+-----------------------------------------+
+| 0xffffffa6 | 2 | SEV secret block entry length (=0x1a) |
++------------+--------+-----------------------------------------+
+| 0xffffffa8 | 16 | SEV secret block GUID: |
+| | | 4c2eb361-7d9b-4cc3-8081-127c90d3d294 |
++------------+--------+-----------------------------------------+
+| 0xffffffb8 | 4 | SEV-ES AP reset RIP |
++------------+--------+-----------------------------------------+
+| 0xffffffbc | 2 | SEV-ES reset block entry length (=0x16) |
++------------+--------+-----------------------------------------+
+| 0xffffffbe | 16 | SEV-ES reset block entry GUID: |
+| | | 00f771de-1a7e-4fcb-890e-68c77e2fb44e |
++------------+--------+-----------------------------------------+
+| 0xffffffce | 2 | Length of entire table including table |
+| | | footer GUID and length (=0x72) |
++------------+--------+-----------------------------------------+
+| 0xffffffd0 | 16 | OVMF GUIDed table footer GUID: |
+| | | 96b582de-1fb2-45f7-baea-a366c55a082d |
++------------+--------+-----------------------------------------+
+| 0xffffffe0 | 8 | Application processor entry point code |
++------------+--------+-----------------------------------------+
+| 0xffffffe8 | 8 | "\0\0\0\0VTF\0" |
++------------+--------+-----------------------------------------+
+| 0xfffffff0 | 16 | Reset vector code |
++------------+--------+-----------------------------------------+
+
+
+Table entries description
+=========================
+
+SEV-ES reset block
+------------------
+
+Entry GUID: 00f771de-1a7e-4fcb-890e-68c77e2fb44e
+
+For the initial boot of an AP under SEV-ES, the "reset" RIP must be
+programmed to the RAM area defined by this entry. The entry's format
+is:
+
+* IP value [0:15]
+* CS segment base [31:16]
+
+A hypervisor reads the CS segment base and IP value. The CS segment
+base value represents the high order 16-bits of the CS segment base, so
+the hypervisor must left shift the value of the CS segment base by 16
+bits to form the full CS segment base for the CS segment register. It
+would then program the EIP register with the IP value as read.
+
+
+SEV secret block
+----------------
+
+Entry GUID: 4c2eb361-7d9b-4cc3-8081-127c90d3d294
+
+This describes the guest RAM area where the hypervisor should inject the
+Guest Owner secret (using SEV_LAUNCH_SECRET).
+
+
+SEV hashes table
+----------------
+
+Entry GUID: 7255371f-3a3b-4b04-927b-1da6efa8d454
+
+This describes the guest RAM area where the hypervisor should install a
+table describing the hashes of certain firmware configuration device
+files that would otherwise be passed in unchecked. The current use is
+for the kernel, initrd and command line values, but others may be added.
+
+
+.. _OVMF Reset Vector file:
+ https://github.com/tianocore/edk2/blob/master/OvmfPkg/ResetVector/Ia16/ResetVectorVtf0.asm
diff --git a/docs/specs/standard-vga.rst b/docs/specs/standard-vga.rst
new file mode 100644
index 0000000000..992f429ced
--- /dev/null
+++ b/docs/specs/standard-vga.rst
@@ -0,0 +1,94 @@
+
+QEMU Standard VGA
+=================
+
+Exists in two variants, for isa and pci.
+
+command line switches:
+
+``-vga std``
+ picks isa for -M isapc, otherwise pci
+``-device VGA``
+ pci variant
+``-device isa-vga``
+ isa variant
+``-device secondary-vga``
+ legacy-free pci variant
+
+
+PCI spec
+--------
+
+Applies to the pci variant only for obvious reasons.
+
+PCI ID
+ ``1234:1111``
+
+PCI Region 0
+ Framebuffer memory, 16 MB in size (by default).
+ Size is tunable via vga_mem_mb property.
+
+PCI Region 1
+ Reserved (so we have the option to make the framebuffer bar 64bit).
+
+PCI Region 2
+ MMIO bar, 4096 bytes in size (QEMU 1.3+)
+
+PCI ROM Region
+ Holds the vgabios (QEMU 0.14+).
+
+
+The legacy-free variant has no ROM and has ``PCI_CLASS_DISPLAY_OTHER``
+instead of ``PCI_CLASS_DISPLAY_VGA``.
+
+
+IO ports used
+-------------
+
+Doesn't apply to the legacy-free pci variant, use the MMIO bar instead.
+
+``03c0 - 03df``
+ standard vga ports
+``01ce``
+ bochs vbe interface index port
+``01cf``
+ bochs vbe interface data port (x86 only)
+``01d0``
+ bochs vbe interface data port
+
+
+Memory regions used
+-------------------
+
+``0xe0000000``
+ Framebuffer memory, isa variant only.
+
+The pci variant used to mirror the framebuffer bar here, QEMU 0.14+
+stops doing that (except when in ``-M pc-$old`` compat mode).
+
+
+MMIO area spec
+--------------
+
+Likewise applies to the pci variant only for obvious reasons.
+
+``0000 - 03ff``
+ edid data blob.
+``0400 - 041f``
+ vga ioports (``0x3c0`` to ``0x3df``), remapped 1:1. Word access
+ is supported, bytes are written in little endian order (aka index
+ port first), so indexed registers can be updated with a single
+ mmio write (and thus only one vmexit).
+``0500 - 0515``
+ bochs dispi interface registers, mapped flat without index/data ports.
+ Use ``(index << 1)`` as offset for (16bit) register access.
+``0600 - 0607``
+ QEMU extended registers. QEMU 2.2+ only.
+ The pci revision is 2 (or greater) when these registers are present.
+ The registers are 32bit.
+``0600``
+ QEMU extended register region size, in bytes.
+``0604``
+ framebuffer endianness register.
+ - ``0xbebebebe`` indicates big endian.
+ - ``0x1e1e1e1e`` indicates little endian.
diff --git a/docs/specs/standard-vga.txt b/docs/specs/standard-vga.txt
deleted file mode 100644
index 18f75f1b30..0000000000
--- a/docs/specs/standard-vga.txt
+++ /dev/null
@@ -1,81 +0,0 @@
-
-QEMU Standard VGA
-=================
-
-Exists in two variants, for isa and pci.
-
-command line switches:
- -vga std [ picks isa for -M isapc, otherwise pci ]
- -device VGA [ pci variant ]
- -device isa-vga [ isa variant ]
- -device secondary-vga [ legacy-free pci variant ]
-
-
-PCI spec
---------
-
-Applies to the pci variant only for obvious reasons.
-
-PCI ID: 1234:1111
-
-PCI Region 0:
- Framebuffer memory, 16 MB in size (by default).
- Size is tunable via vga_mem_mb property.
-
-PCI Region 1:
- Reserved (so we have the option to make the framebuffer bar 64bit).
-
-PCI Region 2:
- MMIO bar, 4096 bytes in size (qemu 1.3+)
-
-PCI ROM Region:
- Holds the vgabios (qemu 0.14+).
-
-
-The legacy-free variant has no ROM and has PCI_CLASS_DISPLAY_OTHER
-instead of PCI_CLASS_DISPLAY_VGA.
-
-
-IO ports used
--------------
-
-Doesn't apply to the legacy-free pci variant, use the MMIO bar instead.
-
-03c0 - 03df : standard vga ports
-01ce : bochs vbe interface index port
-01cf : bochs vbe interface data port (x86 only)
-01d0 : bochs vbe interface data port
-
-
-Memory regions used
--------------------
-
-0xe0000000 : Framebuffer memory, isa variant only.
-
-The pci variant used to mirror the framebuffer bar here, qemu 0.14+
-stops doing that (except when in -M pc-$old compat mode).
-
-
-MMIO area spec
---------------
-
-Likewise applies to the pci variant only for obvious reasons.
-
-0000 - 03ff : edid data blob.
-0400 - 041f : vga ioports (0x3c0 -> 0x3df), remapped 1:1.
- word access is supported, bytes are written
- in little endia order (aka index port first),
- so indexed registers can be updated with a
- single mmio write (and thus only one vmexit).
-0500 - 0515 : bochs dispi interface registers, mapped flat
- without index/data ports. Use (index << 1)
- as offset for (16bit) register access.
-
-0600 - 0607 : qemu extended registers. qemu 2.2+ only.
- The pci revision is 2 (or greater) when
- these registers are present. The registers
- are 32bit.
- 0600 : qemu extended register region size, in bytes.
- 0604 : framebuffer endianness register.
- - 0xbebebebe indicates big endian.
- - 0x1e1e1e1e indicates little endian.
diff --git a/docs/specs/tpm.rst b/docs/specs/tpm.rst
index 3be190343a..68cb8cf7e6 100644
--- a/docs/specs/tpm.rst
+++ b/docs/specs/tpm.rst
@@ -1,3 +1,5 @@
+.. _tpm-device:
+
===============
QEMU TPM Device
===============
@@ -21,12 +23,16 @@ QEMU files related to TPM TIS interface:
- ``hw/tpm/tpm_tis_common.c``
- ``hw/tpm/tpm_tis_isa.c``
- ``hw/tpm/tpm_tis_sysbus.c``
+ - ``hw/tpm/tpm_tis_i2c.c``
- ``hw/tpm/tpm_tis.h``
Both an ISA device and a sysbus device are available. The former is
used with pc/q35 machine while the latter can be instantiated in the
Arm virt machine.
+An I2C device support is also provided which can be instantiated in the Arm
+based emulation machines. This device only supports the TPM 2 protocol.
+
CRB interface
-------------
@@ -250,24 +256,25 @@ hardware TPM ``/dev/tpm0``:
The following commands should result in similar output inside the VM
with a Linux kernel that either has the TPM TIS driver built-in or
-available as a module:
+available as a module (assuming a TPM 2 is passed through):
.. code-block:: console
# dmesg | grep -i tpm
- [ 0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1)
-
- # dmesg | grep TCPA
- [ 0.000000] ACPI: TCPA 0x0000000003FFD191C 000032 (v02 BOCHS \
- BXPCTCPA 0000001 BXPC 00000001)
+ [ 0.012560] ACPI: TPM2 0x000000000BFFD1900 00004C (v04 BOCHS \
+ BXPC 0000001 BXPC 00000001)
# ls -l /dev/tpm*
- crw-------. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0
+ crw-rw----. 1 tss root 10, 224 Sep 6 12:36 /dev/tpm0
+ crw-rw----. 1 tss rss 253, 65536 Sep 6 12:36 /dev/tpmrm0
- # find /sys/devices/ | grep pcrs$ | xargs cat
- PCR-00: 35 4E 3B CE 23 9F 38 59 ...
+ Starting with Linux 5.12 there are PCR entries for TPM 2 in sysfs:
+ # find /sys/devices/ -type f | grep pcr-sha
+ ...
+ /sys/devices/LNXSYSTEM:00/LNXSYBUS:00/MSFT0101:00/tpm/tpm0/pcr-sha256/1
+ ...
+ /sys/devices/LNXSYSTEM:00/LNXSYBUS:00/MSFT0101:00/tpm/tpm0/pcr-sha256/9
...
- PCR-23: 00 00 00 00 00 00 00 00 ...
The QEMU TPM emulator device
----------------------------
@@ -304,6 +311,7 @@ a socket interface. They do not need to be run as root.
mkdir /tmp/mytpm1
swtpm socket --tpmstate dir=/tmp/mytpm1 \
--ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \
+ --tpm2 \
--log level=20
Command line to start QEMU with the TPM emulator device communicating
@@ -335,9 +343,9 @@ In case an Arm virt machine is emulated, use the following command line:
.. code-block:: console
- qemu-system-aarch64 -machine virt,gic-version=3,accel=kvm \
+ qemu-system-aarch64 -machine virt,gic-version=3,acpi=off \
-cpu host -m 4G \
- -nographic -no-acpi \
+ -nographic -accel kvm \
-chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
-tpmdev emulator,id=tpm0,chardev=chrtpm \
-device tpm-tis-device,tpmdev=tpm0 \
@@ -346,6 +354,23 @@ In case an Arm virt machine is emulated, use the following command line:
-drive if=pflash,format=raw,file=flash0.img,readonly=on \
-drive if=pflash,format=raw,file=flash1.img
+In case a ast2600-evb bmc machine is emulated and you want to use a TPM device
+attached to I2C bus, use the following command line:
+
+.. code-block:: console
+
+ qemu-system-arm -M ast2600-evb -nographic \
+ -kernel arch/arm/boot/zImage \
+ -dtb arch/arm/boot/dts/aspeed-ast2600-evb.dtb \
+ -initrd rootfs.cpio \
+ -chardev socket,id=chrtpm,path=/tmp/mytpm1/swtpm-sock \
+ -tpmdev emulator,id=tpm0,chardev=chrtpm \
+ -device tpm-tis-i2c,tpmdev=tpm0,bus=aspeed.i2c.bus.12,address=0x2e
+
+ For testing, use this command to load the driver to the correct address
+
+ echo tpm_tis_i2c 0x2e > /sys/bus/i2c/devices/i2c-12/new_device
+
In case SeaBIOS is used as firmware, it should show the TPM menu item
after entering the menu with 'ESC'.
@@ -365,19 +390,20 @@ available as a module:
.. code-block:: console
# dmesg | grep -i tpm
- [ 0.711310] tpm_tis 00:06: 1.2 TPM (device=id 0x1, rev-id 1)
-
- # dmesg | grep TCPA
- [ 0.000000] ACPI: TCPA 0x0000000003FFD191C 000032 (v02 BOCHS \
- BXPCTCPA 0000001 BXPC 00000001)
+ [ 0.012560] ACPI: TPM2 0x000000000BFFD1900 00004C (v04 BOCHS \
+ BXPC 0000001 BXPC 00000001)
# ls -l /dev/tpm*
- crw-------. 1 root root 10, 224 Jul 11 10:11 /dev/tpm0
+ crw-rw----. 1 tss root 10, 224 Sep 6 12:36 /dev/tpm0
+ crw-rw----. 1 tss rss 253, 65536 Sep 6 12:36 /dev/tpmrm0
- # find /sys/devices/ | grep pcrs$ | xargs cat
- PCR-00: 35 4E 3B CE 23 9F 38 59 ...
+ Starting with Linux 5.12 there are PCR entries for TPM 2 in sysfs:
+ # find /sys/devices/ -type f | grep pcr-sha
+ ...
+ /sys/devices/LNXSYSTEM:00/LNXSYBUS:00/MSFT0101:00/tpm/tpm0/pcr-sha256/1
+ ...
+ /sys/devices/LNXSYSTEM:00/LNXSYBUS:00/MSFT0101:00/tpm/tpm0/pcr-sha256/9
...
- PCR-23: 00 00 00 00 00 00 00 00 ...
Migration with the TPM emulator
===============================
@@ -398,7 +424,8 @@ In a 1st terminal start an instance of a swtpm using the following command:
mkdir /tmp/mytpm1
swtpm socket --tpmstate dir=/tmp/mytpm1 \
--ctrl type=unixio,path=/tmp/mytpm1/swtpm-sock \
- --log level=20 --tpm2
+ --tpm2 \
+ --log level=20
In a 2nd terminal start the VM:
diff --git a/docs/specs/virt-ctlr.txt b/docs/specs/virt-ctlr.rst
index 24d38084f7..ad3edde82d 100644
--- a/docs/specs/virt-ctlr.txt
+++ b/docs/specs/virt-ctlr.rst
@@ -1,9 +1,9 @@
Virtual System Controller
=========================
-This device is a simple interface defined for the pure virtual machine with no
-hardware reference implementation to allow the guest kernel to send command
-to the host hypervisor.
+The ``virt-ctrl`` device is a simple interface defined for the pure
+virtual machine with no hardware reference implementation to allow the
+guest kernel to send command to the host hypervisor.
The specification can evolve, the current state is defined as below.
@@ -11,14 +11,12 @@ This is a MMIO mapped device using 256 bytes.
Two 32bit registers are defined:
-1- the features register (read-only, address 0x00)
-
+the features register (read-only, address 0x00)
This register allows the device to report features supported by the
controller.
The only feature supported for the moment is power control (0x01).
-2- the command register (write-only, address 0x04)
-
+the command register (write-only, address 0x04)
This register allows the kernel to send the commands to the hypervisor.
The implemented commands are part of the power control feature and
are reset (1), halt (2) and panic (3).
diff --git a/docs/specs/vmcoreinfo.rst b/docs/specs/vmcoreinfo.rst
new file mode 100644
index 0000000000..6541aa116f
--- /dev/null
+++ b/docs/specs/vmcoreinfo.rst
@@ -0,0 +1,54 @@
+=================
+VMCoreInfo device
+=================
+
+The ``-device vmcoreinfo`` will create a ``fw_cfg`` entry for a guest to
+store dump details.
+
+``etc/vmcoreinfo``
+==================
+
+A guest may use this ``fw_cfg`` entry to add information details to QEMU
+dumps.
+
+The entry of 16 bytes has the following layout, in little-endian::
+
+ #define VMCOREINFO_FORMAT_NONE 0x0
+ #define VMCOREINFO_FORMAT_ELF 0x1
+
+ struct FWCfgVMCoreInfo {
+ uint16_t host_format; /* formats host supports */
+ uint16_t guest_format; /* format guest supplies */
+ uint32_t size; /* size of vmcoreinfo region */
+ uint64_t paddr; /* physical address of vmcoreinfo region */
+ };
+
+Only full write (of 16 bytes) are considered valid for further
+processing of entry values.
+
+A write of 0 in ``guest_format`` will disable further processing of
+vmcoreinfo entry values & content.
+
+You may write a ``guest_format`` that is not supported by the host, in
+which case the entry data can be ignored by QEMU (but you may still
+access it through a debugger, via ``vmcoreinfo_realize::vmcoreinfo_state``).
+
+Format & content
+================
+
+As of QEMU 2.11, only ``VMCOREINFO_FORMAT_ELF`` is supported.
+
+The entry gives location and size of an ELF note that is appended in
+qemu dumps.
+
+The note format/class must be of the target bitness and the size must
+be less than 1Mb.
+
+If the ELF note name is ``VMCOREINFO``, it is expected to be the Linux
+vmcoreinfo note (see `the kernel documentation for its format
+<https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-kernel-vmcoreinfo>`_).
+In this case, qemu dump code will read the content
+as a key=value text file, looking for ``NUMBER(phys_base)`` key
+value. The value is expected to be more accurate than architecture
+guess of the value. This is useful for KASLR-enabled guest with
+ancient tools not handling the ``VMCOREINFO`` note.
diff --git a/docs/specs/vmcoreinfo.txt b/docs/specs/vmcoreinfo.txt
deleted file mode 100644
index bcbca6fe47..0000000000
--- a/docs/specs/vmcoreinfo.txt
+++ /dev/null
@@ -1,53 +0,0 @@
-=================
-VMCoreInfo device
-=================
-
-The `-device vmcoreinfo` will create a fw_cfg entry for a guest to
-store dump details.
-
-etc/vmcoreinfo
-**************
-
-A guest may use this fw_cfg entry to add information details to qemu
-dumps.
-
-The entry of 16 bytes has the following layout, in little-endian::
-
-#define VMCOREINFO_FORMAT_NONE 0x0
-#define VMCOREINFO_FORMAT_ELF 0x1
-
- struct FWCfgVMCoreInfo {
- uint16_t host_format; /* formats host supports */
- uint16_t guest_format; /* format guest supplies */
- uint32_t size; /* size of vmcoreinfo region */
- uint64_t paddr; /* physical address of vmcoreinfo region */
- };
-
-Only full write (of 16 bytes) are considered valid for further
-processing of entry values.
-
-A write of 0 in guest_format will disable further processing of
-vmcoreinfo entry values & content.
-
-You may write a guest_format that is not supported by the host, in
-which case the entry data can be ignored by qemu (but you may still
-access it through a debugger, via vmcoreinfo_realize::vmcoreinfo_state).
-
-Format & content
-****************
-
-As of qemu 2.11, only VMCOREINFO_FORMAT_ELF is supported.
-
-The entry gives location and size of an ELF note that is appended in
-qemu dumps.
-
-The note format/class must be of the target bitness and the size must
-be less than 1Mb.
-
-If the ELF note name is "VMCOREINFO", it is expected to be the Linux
-vmcoreinfo note (see Documentation/ABI/testing/sysfs-kernel-vmcoreinfo
-in Linux source). In this case, qemu dump code will read the content
-as a key=value text file, looking for "NUMBER(phys_base)" key
-value. The value is expected to be more accurate than architecture
-guess of the value. This is useful for KASLR-enabled guest with
-ancient tools not handling the VMCOREINFO note.
diff --git a/docs/specs/vmgenid.rst b/docs/specs/vmgenid.rst
new file mode 100644
index 0000000000..9a3cefcd82
--- /dev/null
+++ b/docs/specs/vmgenid.rst
@@ -0,0 +1,246 @@
+Virtual Machine Generation ID Device
+====================================
+
+..
+ Copyright (C) 2016 Red Hat, Inc.
+ Copyright (C) 2017 Skyport Systems, Inc.
+
+ This work is licensed under the terms of the GNU GPL, version 2 or later.
+ See the COPYING file in the top-level directory.
+
+The VM generation ID (``vmgenid``) device is an emulated device which
+exposes a 128-bit, cryptographically random, integer value identifier,
+referred to as a Globally Unique Identifier, or GUID.
+
+This allows management applications (e.g. libvirt) to notify the guest
+operating system when the virtual machine is executed with a different
+configuration (e.g. snapshot execution or creation from a template). The
+guest operating system notices the change, and is then able to react as
+appropriate by marking its copies of distributed databases as dirty,
+re-initializing its random number generator etc.
+
+
+Requirements
+------------
+
+These requirements are extracted from the "How to implement virtual machine
+generation ID support in a virtualization platform" section of
+`the Microsoft Virtual Machine Generation ID specification
+<http://go.microsoft.com/fwlink/?LinkId=260709>`_ dated August 1, 2012.
+
+- **R1a** The generation ID shall live in an 8-byte aligned buffer.
+
+- **R1b** The buffer holding the generation ID shall be in guest RAM,
+ ROM, or device MMIO range.
+
+- **R1c** The buffer holding the generation ID shall be kept separate from
+ areas used by the operating system.
+
+- **R1d** The buffer shall not be covered by an AddressRangeMemory or
+ AddressRangeACPI entry in the E820 or UEFI memory map.
+
+- **R1e** The generation ID shall not live in a page frame that could be
+ mapped with caching disabled. (In other words, regardless of whether the
+ generation ID lives in RAM, ROM or MMIO, it shall only be mapped as
+ cacheable.)
+
+- **R2** to **R5** [These AML requirements are isolated well enough in the
+ Microsoft specification for us to simply refer to them here.]
+
+- **R6** The hypervisor shall expose a _HID (hardware identifier) object
+ in the VMGenId device's scope that is unique to the hypervisor vendor.
+
+
+QEMU Implementation
+-------------------
+
+The above-mentioned specification does not dictate which ACPI descriptor table
+will contain the VM Generation ID device. Other implementations (Hyper-V and
+Xen) put it in the main descriptor table (Differentiated System Description
+Table or DSDT). For ease of debugging and implementation, we have decided to
+put it in its own Secondary System Description Table, or SSDT.
+
+The following is a dump of the contents from a running system::
+
+ # iasl -p ./SSDT -d /sys/firmware/acpi/tables/SSDT
+
+ Intel ACPI Component Architecture
+ ASL+ Optimizing Compiler version 20150717-64
+ Copyright (c) 2000 - 2015 Intel Corporation
+
+ Reading ACPI table from file /sys/firmware/acpi/tables/SSDT - Length
+ 00000198 (0x0000C6)
+ ACPI: SSDT 0x0000000000000000 0000C6 (v01 BOCHS VMGENID 00000001 BXPC 00000001)
+ Acpi table [SSDT] successfully installed and loaded
+ Pass 1 parse of [SSDT]
+ Pass 2 parse of [SSDT]
+ Parsing Deferred Opcodes (Methods/Buffers/Packages/Regions)
+
+ Parsing completed
+ Disassembly completed
+ ASL Output: ./SSDT.dsl - 1631 bytes
+ # cat SSDT.dsl
+ /*
+ * Intel ACPI Component Architecture
+ * AML/ASL+ Disassembler version 20150717-64
+ * Copyright (c) 2000 - 2015 Intel Corporation
+ *
+ * Disassembling to symbolic ASL+ operators
+ *
+ * Disassembly of /sys/firmware/acpi/tables/SSDT, Sun Feb 5 00:19:37 2017
+ *
+ * Original Table Header:
+ * Signature "SSDT"
+ * Length 0x000000CA (202)
+ * Revision 0x01
+ * Checksum 0x4B
+ * OEM ID "BOCHS "
+ * OEM Table ID "VMGENID"
+ * OEM Revision 0x00000001 (1)
+ * Compiler ID "BXPC"
+ * Compiler Version 0x00000001 (1)
+ */
+ DefinitionBlock ("/sys/firmware/acpi/tables/SSDT.aml", "SSDT", 1, "BOCHS ", "VMGENID", 0x00000001)
+ {
+ Name (VGIA, 0x07FFF000)
+ Scope (\_SB)
+ {
+ Device (VGEN)
+ {
+ Name (_HID, "QEMUVGID") // _HID: Hardware ID
+ Name (_CID, "VM_Gen_Counter") // _CID: Compatible ID
+ Name (_DDN, "VM_Gen_Counter") // _DDN: DOS Device Name
+ Method (_STA, 0, NotSerialized) // _STA: Status
+ {
+ Local0 = 0x0F
+ If ((VGIA == Zero))
+ {
+ Local0 = Zero
+ }
+
+ Return (Local0)
+ }
+
+ Method (ADDR, 0, NotSerialized)
+ {
+ Local0 = Package (0x02) {}
+ Index (Local0, Zero) = (VGIA + 0x28)
+ Index (Local0, One) = Zero
+ Return (Local0)
+ }
+ }
+ }
+
+ Method (\_GPE._E05, 0, NotSerialized) // _Exx: Edge-Triggered GPE
+ {
+ Notify (\_SB.VGEN, 0x80) // Status Change
+ }
+ }
+
+
+Design Details:
+---------------
+
+Requirements R1a through R1e dictate that the memory holding the
+VM Generation ID must be allocated and owned by the guest firmware,
+in this case BIOS or UEFI. However, to be useful, QEMU must be able to
+change the contents of the memory at runtime, specifically when starting a
+backed-up or snapshotted image. In order to do this, QEMU must know the
+address that has been allocated.
+
+The mechanism chosen for this memory sharing is writable fw_cfg blobs.
+These are data object that are visible to both QEMU and guests, and are
+addressable as sequential files.
+
+More information about fw_cfg can be found in :doc:`fw_cfg`.
+
+Two fw_cfg blobs are used in this case:
+
+``/etc/vmgenid_guid``
+
+- contains the actual VM Generation ID GUID
+- read-only to the guest
+
+``/etc/vmgenid_addr``
+
+- contains the address of the downloaded vmgenid blob
+- writable by the guest
+
+
+QEMU sends the following commands to the guest at startup:
+
+1. Allocate memory for vmgenid_guid fw_cfg blob.
+2. Write the address of vmgenid_guid into the SSDT (VGIA ACPI variable as
+ shown above in the iasl dump). Note that this change is not propagated
+ back to QEMU.
+3. Write the address of vmgenid_guid back to QEMU's copy of vmgenid_addr
+ via the fw_cfg DMA interface.
+
+After step 3, QEMU is able to update the contents of vmgenid_guid at will.
+
+Since BIOS or UEFI does not necessarily run when we wish to change the GUID,
+the value of VGIA is persisted via the VMState mechanism.
+
+As spelled out in the specification, any change to the GUID executes an
+ACPI notification. The exact handler to use is not specified, so the vmgenid
+device uses the first unused one: ``\_GPE._E05``.
+
+
+Endian-ness Considerations:
+---------------------------
+
+Although not specified in Microsoft's document, it is assumed that the
+device is expected to use little-endian format.
+
+All GUID passed in via command line or monitor are treated as big-endian.
+GUID values displayed via monitor are shown in big-endian format.
+
+
+GUID Storage Format:
+--------------------
+
+In order to implement an OVMF "SDT Header Probe Suppressor", the contents of
+the vmgenid_guid fw_cfg blob are not simply a 128-bit GUID. There is also
+significant padding in order to align and fill a memory page, as shown in the
+following diagram::
+
+ +----------------------------------+
+ | SSDT with OEM Table ID = VMGENID |
+ +----------------------------------+
+ | ... | TOP OF PAGE
+ | VGIA dword object ---------------|-----> +---------------------------+
+ | ... | | fw-allocated array for |
+ | _STA method referring to VGIA | | "etc/vmgenid_guid" |
+ | ... | +---------------------------+
+ | ADDR method referring to VGIA | | 0: OVMF SDT Header probe |
+ | ... | | suppressor |
+ +----------------------------------+ | 36: padding for 8-byte |
+ | alignment |
+ | 40: GUID |
+ | 56: padding to page size |
+ +---------------------------+
+ END OF PAGE
+
+
+Device Usage:
+-------------
+
+The device has one property, which may be only be set using the command line:
+
+``guid``
+ sets the value of the GUID. A special value ``auto`` instructs
+ QEMU to generate a new random GUID.
+
+For example::
+
+ QEMU -device vmgenid,guid="324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87"
+ QEMU -device vmgenid,guid=auto
+
+The property may be queried via QMP/HMP::
+
+ (QEMU) query-vm-generation-id
+ {"return": {"guid": "324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87"}}
+
+Setting of this parameter is intentionally left out from the QMP/HMP
+interfaces. There are no known use cases for changing the GUID once QEMU is
+running, and adding this capability would greatly increase the complexity.
diff --git a/docs/specs/vmgenid.txt b/docs/specs/vmgenid.txt
deleted file mode 100644
index aa9f518676..0000000000
--- a/docs/specs/vmgenid.txt
+++ /dev/null
@@ -1,245 +0,0 @@
-VIRTUAL MACHINE GENERATION ID
-=============================
-
-Copyright (C) 2016 Red Hat, Inc.
-Copyright (C) 2017 Skyport Systems, Inc.
-
-This work is licensed under the terms of the GNU GPL, version 2 or later.
-See the COPYING file in the top-level directory.
-
-===
-
-The VM generation ID (vmgenid) device is an emulated device which
-exposes a 128-bit, cryptographically random, integer value identifier,
-referred to as a Globally Unique Identifier, or GUID.
-
-This allows management applications (e.g. libvirt) to notify the guest
-operating system when the virtual machine is executed with a different
-configuration (e.g. snapshot execution or creation from a template). The
-guest operating system notices the change, and is then able to react as
-appropriate by marking its copies of distributed databases as dirty,
-re-initializing its random number generator etc.
-
-
-Requirements
-------------
-
-These requirements are extracted from the "How to implement virtual machine
-generation ID support in a virtualization platform" section of the
-specification, dated August 1, 2012.
-
-
-The document may be found on the web at:
- http://go.microsoft.com/fwlink/?LinkId=260709
-
-R1a. The generation ID shall live in an 8-byte aligned buffer.
-
-R1b. The buffer holding the generation ID shall be in guest RAM, ROM, or device
- MMIO range.
-
-R1c. The buffer holding the generation ID shall be kept separate from areas
- used by the operating system.
-
-R1d. The buffer shall not be covered by an AddressRangeMemory or
- AddressRangeACPI entry in the E820 or UEFI memory map.
-
-R1e. The generation ID shall not live in a page frame that could be mapped with
- caching disabled. (In other words, regardless of whether the generation ID
- lives in RAM, ROM or MMIO, it shall only be mapped as cacheable.)
-
-R2 to R5. [These AML requirements are isolated well enough in the Microsoft
- specification for us to simply refer to them here.]
-
-R6. The hypervisor shall expose a _HID (hardware identifier) object in the
- VMGenId device's scope that is unique to the hypervisor vendor.
-
-
-QEMU Implementation
--------------------
-
-The above-mentioned specification does not dictate which ACPI descriptor table
-will contain the VM Generation ID device. Other implementations (Hyper-V and
-Xen) put it in the main descriptor table (Differentiated System Description
-Table or DSDT). For ease of debugging and implementation, we have decided to
-put it in its own Secondary System Description Table, or SSDT.
-
-The following is a dump of the contents from a running system:
-
-# iasl -p ./SSDT -d /sys/firmware/acpi/tables/SSDT
-
-Intel ACPI Component Architecture
-ASL+ Optimizing Compiler version 20150717-64
-Copyright (c) 2000 - 2015 Intel Corporation
-
-Reading ACPI table from file /sys/firmware/acpi/tables/SSDT - Length
-00000198 (0x0000C6)
-ACPI: SSDT 0x0000000000000000 0000C6 (v01 BOCHS VMGENID 00000001 BXPC
-00000001)
-Acpi table [SSDT] successfully installed and loaded
-Pass 1 parse of [SSDT]
-Pass 2 parse of [SSDT]
-Parsing Deferred Opcodes (Methods/Buffers/Packages/Regions)
-
-Parsing completed
-Disassembly completed
-ASL Output: ./SSDT.dsl - 1631 bytes
-# cat SSDT.dsl
-/*
- * Intel ACPI Component Architecture
- * AML/ASL+ Disassembler version 20150717-64
- * Copyright (c) 2000 - 2015 Intel Corporation
- *
- * Disassembling to symbolic ASL+ operators
- *
- * Disassembly of /sys/firmware/acpi/tables/SSDT, Sun Feb 5 00:19:37 2017
- *
- * Original Table Header:
- * Signature "SSDT"
- * Length 0x000000CA (202)
- * Revision 0x01
- * Checksum 0x4B
- * OEM ID "BOCHS "
- * OEM Table ID "VMGENID"
- * OEM Revision 0x00000001 (1)
- * Compiler ID "BXPC"
- * Compiler Version 0x00000001 (1)
- */
-DefinitionBlock ("/sys/firmware/acpi/tables/SSDT.aml", "SSDT", 1, "BOCHS ",
-"VMGENID", 0x00000001)
-{
- Name (VGIA, 0x07FFF000)
- Scope (\_SB)
- {
- Device (VGEN)
- {
- Name (_HID, "QEMUVGID") // _HID: Hardware ID
- Name (_CID, "VM_Gen_Counter") // _CID: Compatible ID
- Name (_DDN, "VM_Gen_Counter") // _DDN: DOS Device Name
- Method (_STA, 0, NotSerialized) // _STA: Status
- {
- Local0 = 0x0F
- If ((VGIA == Zero))
- {
- Local0 = Zero
- }
-
- Return (Local0)
- }
-
- Method (ADDR, 0, NotSerialized)
- {
- Local0 = Package (0x02) {}
- Index (Local0, Zero) = (VGIA + 0x28)
- Index (Local0, One) = Zero
- Return (Local0)
- }
- }
- }
-
- Method (\_GPE._E05, 0, NotSerialized) // _Exx: Edge-Triggered GPE
- {
- Notify (\_SB.VGEN, 0x80) // Status Change
- }
-}
-
-
-Design Details:
----------------
-
-Requirements R1a through R1e dictate that the memory holding the
-VM Generation ID must be allocated and owned by the guest firmware,
-in this case BIOS or UEFI. However, to be useful, QEMU must be able to
-change the contents of the memory at runtime, specifically when starting a
-backed-up or snapshotted image. In order to do this, QEMU must know the
-address that has been allocated.
-
-The mechanism chosen for this memory sharing is writeable fw_cfg blobs.
-These are data object that are visible to both QEMU and guests, and are
-addressable as sequential files.
-
-More information about fw_cfg can be found in "docs/specs/fw_cfg.txt"
-
-Two fw_cfg blobs are used in this case:
-
-/etc/vmgenid_guid - contains the actual VM Generation ID GUID
- - read-only to the guest
-/etc/vmgenid_addr - contains the address of the downloaded vmgenid blob
- - writeable by the guest
-
-
-QEMU sends the following commands to the guest at startup:
-
-1. Allocate memory for vmgenid_guid fw_cfg blob.
-2. Write the address of vmgenid_guid into the SSDT (VGIA ACPI variable as
- shown above in the iasl dump). Note that this change is not propagated
- back to QEMU.
-3. Write the address of vmgenid_guid back to QEMU's copy of vmgenid_addr
- via the fw_cfg DMA interface.
-
-After step 3, QEMU is able to update the contents of vmgenid_guid at will.
-
-Since BIOS or UEFI does not necessarily run when we wish to change the GUID,
-the value of VGIA is persisted via the VMState mechanism.
-
-As spelled out in the specification, any change to the GUID executes an
-ACPI notification. The exact handler to use is not specified, so the vmgenid
-device uses the first unused one: \_GPE._E05.
-
-
-Endian-ness Considerations:
----------------------------
-
-Although not specified in Microsoft's document, it is assumed that the
-device is expected to use little-endian format.
-
-All GUID passed in via command line or monitor are treated as big-endian.
-GUID values displayed via monitor are shown in big-endian format.
-
-
-GUID Storage Format:
---------------------
-
-In order to implement an OVMF "SDT Header Probe Suppressor", the contents of
-the vmgenid_guid fw_cfg blob are not simply a 128-bit GUID. There is also
-significant padding in order to align and fill a memory page, as shown in the
-following diagram:
-
-+----------------------------------+
-| SSDT with OEM Table ID = VMGENID |
-+----------------------------------+
-| ... | TOP OF PAGE
-| VGIA dword object ---------------|-----> +---------------------------+
-| ... | | fw-allocated array for |
-| _STA method referring to VGIA | | "etc/vmgenid_guid" |
-| ... | +---------------------------+
-| ADDR method referring to VGIA | | 0: OVMF SDT Header probe |
-| ... | | suppressor |
-+----------------------------------+ | 36: padding for 8-byte |
- | alignment |
- | 40: GUID |
- | 56: padding to page size |
- +---------------------------+
- END OF PAGE
-
-
-Device Usage:
--------------
-
-The device has one property, which may be only be set using the command line:
-
- guid - sets the value of the GUID. A special value "auto" instructs
- QEMU to generate a new random GUID.
-
-For example:
-
- QEMU -device vmgenid,guid="324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87"
- QEMU -device vmgenid,guid=auto
-
-The property may be queried via QMP/HMP:
-
- (QEMU) query-vm-generation-id
- {"return": {"guid": "324e6eaf-d1d1-4bf6-bf41-b9bb6c91fb87"}}
-
-Setting of this parameter is intentionally left out from the QMP/HMP
-interfaces. There are no known use cases for changing the GUID once QEMU is
-running, and adding this capability would greatly increase the complexity.
diff --git a/docs/specs/vmw_pvscsi-spec.rst b/docs/specs/vmw_pvscsi-spec.rst
new file mode 100644
index 0000000000..b6f434a418
--- /dev/null
+++ b/docs/specs/vmw_pvscsi-spec.rst
@@ -0,0 +1,115 @@
+==============================
+VMWare PVSCSI Device Interface
+==============================
+
+..
+ Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD.
+
+This document describes the VMWare PVSCSI device interface specification,
+based on the source code of the PVSCSI Linux driver from kernel 3.0.4.
+
+Overview
+========
+
+The interface is based on a memory area shared between hypervisor and VM.
+The memory area is obtained by driver as a device IO memory resource of
+``PVSCSI_MEM_SPACE_SIZE`` length.
+The shared memory consists of a registers area and a rings area.
+The registers area is used to raise hypervisor interrupts and issue device
+commands. The rings area is used to transfer data descriptors and SCSI
+commands from VM to hypervisor and to transfer messages produced by
+hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA.
+
+PVSCSI Device Registers
+=======================
+
+The length of the registers area is 1 page
+(``PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES``). The structure of the
+registers area is described by the ``PVSCSIRegOffset`` enum. There
+are registers to issue device commands (with optional short data),
+issue device interrupts, and control interrupt masking.
+
+PVSCSI Device Rings
+===================
+
+There are three rings in shared memory:
+
+Request ring (``struct PVSCSIRingReqDesc *req_ring``)
+ ring for OS to device requests
+
+Completion ring (``struct PVSCSIRingCmpDesc *cmp_ring``)
+ ring for device request completions
+
+Message ring (``struct PVSCSIRingMsgDesc *msg_ring``)
+ ring for messages from device. This ring is optional and the
+ guest might not configure it.
+
+There is a control area (``struct PVSCSIRingsState *rings_state``)
+used to control rings operation.
+
+PVSCSI Device to Host Interrupts
+================================
+
+The following interrupt types are supported by the PVSCSI device:
+
+Completion interrupts (completion ring notifications):
+
+- ``PVSCSI_INTR_CMPL_0``
+- ``PVSCSI_INTR_CMPL_1``
+
+Message interrupts (message ring notifications):
+
+- ``PVSCSI_INTR_MSG_0``
+- ``PVSCSI_INTR_MSG_1``
+
+Interrupts are controlled via the ``PVSCSI_REG_OFFSET_INTR_MASK``
+register. If a bit is set it means the interrupt is enabled, and if
+it is clear then the interrupt is disabled.
+
+The interrupt modes supported are legacy, MSI and MSI-X.
+In the case of legacy interrupts, the ``PVSCSI_REG_OFFSET_INTR_STATUS``
+register is used to check which interrupt has arrived. Interrupts are
+acknowledged when the corresponding bit is written to the interrupt
+status register.
+
+PVSCSI Device Operation Sequences
+=================================
+
+Startup sequence
+----------------
+
+a. Issue ``PVSCSI_CMD_ADAPTER_RESET`` command
+b. Windows driver reads interrupt status register here
+c. Issue ``PVSCSI_CMD_SETUP_MSG_RING`` command with no additional data,
+ check status and disable device messages if error returned
+ (Omitted if device messages disabled by driver configuration)
+d. Issue ``PVSCSI_CMD_SETUP_RINGS`` command, provide rings configuration
+ as ``struct PVSCSICmdDescSetupRings``
+e. Issue ``PVSCSI_CMD_SETUP_MSG_RING`` command again, provide
+ rings configuration as ``struct PVSCSICmdDescSetupMsgRing``
+f. Unmask completion and message (if device messages enabled) interrupts
+
+Shutdown sequence
+-----------------
+
+a. Mask interrupts
+b. Flush request ring using ``PVSCSI_REG_OFFSET_KICK_NON_RW_IO``
+c. Issue ``PVSCSI_CMD_ADAPTER_RESET`` command
+
+Send request
+------------
+
+a. Fill next free request ring descriptor
+b. Issue ``PVSCSI_REG_OFFSET_KICK_RW_IO`` for R/W operations
+ or ``PVSCSI_REG_OFFSET_KICK_NON_RW_IO`` for other operations
+
+Abort command
+-------------
+
+a. Issue ``PVSCSI_CMD_ABORT_CMD`` command
+
+Request completion processing
+-----------------------------
+
+a. Upon completion interrupt arrival process completion
+ and message (if enabled) rings
diff --git a/docs/specs/vmw_pvscsi-spec.txt b/docs/specs/vmw_pvscsi-spec.txt
deleted file mode 100644
index 49affb2a42..0000000000
--- a/docs/specs/vmw_pvscsi-spec.txt
+++ /dev/null
@@ -1,92 +0,0 @@
-General Description
-===================
-
-This document describes VMWare PVSCSI device interface specification.
-Created by Dmitry Fleytman (dmitry@daynix.com), Daynix Computing LTD.
-Based on source code of PVSCSI Linux driver from kernel 3.0.4
-
-PVSCSI Device Interface Overview
-================================
-
-The interface is based on memory area shared between hypervisor and VM.
-Memory area is obtained by driver as device IO memory resource of
-PVSCSI_MEM_SPACE_SIZE length.
-The shared memory consists of registers area and rings area.
-The registers area is used to raise hypervisor interrupts and issue device
-commands. The rings area is used to transfer data descriptors and SCSI
-commands from VM to hypervisor and to transfer messages produced by
-hypervisor to VM. Data itself is transferred via virtual scatter-gather DMA.
-
-PVSCSI Device Registers
-=======================
-
-The length of the registers area is 1 page (PVSCSI_MEM_SPACE_COMMAND_NUM_PAGES).
-The structure of the registers area is described by the PVSCSIRegOffset enum.
-There are registers to issue device command (with optional short data),
-issue device interrupt, control interrupts masking.
-
-PVSCSI Device Rings
-===================
-
-There are three rings in shared memory:
-
- 1. Request ring (struct PVSCSIRingReqDesc *req_ring)
- - ring for OS to device requests
- 2. Completion ring (struct PVSCSIRingCmpDesc *cmp_ring)
- - ring for device request completions
- 3. Message ring (struct PVSCSIRingMsgDesc *msg_ring)
- - ring for messages from device.
- This ring is optional and the guest might not configure it.
-There is a control area (struct PVSCSIRingsState *rings_state) used to control
-rings operation.
-
-PVSCSI Device to Host Interrupts
-================================
-There are following interrupt types supported by PVSCSI device:
- 1. Completion interrupts (completion ring notifications):
- PVSCSI_INTR_CMPL_0
- PVSCSI_INTR_CMPL_1
- 2. Message interrupts (message ring notifications):
- PVSCSI_INTR_MSG_0
- PVSCSI_INTR_MSG_1
-
-Interrupts are controlled via PVSCSI_REG_OFFSET_INTR_MASK register
-Bit set means interrupt enabled, bit cleared - disabled
-
-Interrupt modes supported are legacy, MSI and MSI-X
-In case of legacy interrupts, register PVSCSI_REG_OFFSET_INTR_STATUS
-is used to check which interrupt has arrived. Interrupts are
-acknowledged when the corresponding bit is written to the interrupt
-status register.
-
-PVSCSI Device Operation Sequences
-=================================
-
-1. Startup sequence:
- a. Issue PVSCSI_CMD_ADAPTER_RESET command;
- aa. Windows driver reads interrupt status register here;
- b. Issue PVSCSI_CMD_SETUP_MSG_RING command with no additional data,
- check status and disable device messages if error returned;
- (Omitted if device messages disabled by driver configuration)
- c. Issue PVSCSI_CMD_SETUP_RINGS command, provide rings configuration
- as struct PVSCSICmdDescSetupRings;
- d. Issue PVSCSI_CMD_SETUP_MSG_RING command again, provide
- rings configuration as struct PVSCSICmdDescSetupMsgRing;
- e. Unmask completion and message (if device messages enabled) interrupts.
-
-2. Shutdown sequences
- a. Mask interrupts;
- b. Flush request ring using PVSCSI_REG_OFFSET_KICK_NON_RW_IO;
- c. Issue PVSCSI_CMD_ADAPTER_RESET command.
-
-3. Send request
- a. Fill next free request ring descriptor;
- b. Issue PVSCSI_REG_OFFSET_KICK_RW_IO for R/W operations;
- or PVSCSI_REG_OFFSET_KICK_NON_RW_IO for other operations.
-
-4. Abort command
- a. Issue PVSCSI_CMD_ABORT_CMD command;
-
-5. Request completion processing
- a. Upon completion interrupt arrival process completion
- and message (if enabled) rings.
diff --git a/docs/sphinx-static/custom.js b/docs/sphinx-static/custom.js
new file mode 100644
index 0000000000..71a8605305
--- /dev/null
+++ b/docs/sphinx-static/custom.js
@@ -0,0 +1,9 @@
+document.addEventListener('keydown', (event) => {
+ // find a better way to look it up?
+ let search_input = document.getElementsByName('q')[0];
+
+ if (event.code === 'KeyS' && document.activeElement !== search_input) {
+ event.preventDefault();
+ search_input.focus();
+ }
+});
diff --git a/docs/sphinx/dbusdoc.py b/docs/sphinx/dbusdoc.py
new file mode 100644
index 0000000000..be284ed08f
--- /dev/null
+++ b/docs/sphinx/dbusdoc.py
@@ -0,0 +1,166 @@
+# D-Bus XML documentation extension
+#
+# Copyright (C) 2021, Red Hat Inc.
+#
+# SPDX-License-Identifier: LGPL-2.1-or-later
+#
+# Author: Marc-André Lureau <marcandre.lureau@redhat.com>
+"""dbus-doc is a Sphinx extension that provides documentation from D-Bus XML."""
+
+import os
+import re
+from typing import (
+ TYPE_CHECKING,
+ Any,
+ Callable,
+ Dict,
+ Iterator,
+ List,
+ Optional,
+ Sequence,
+ Set,
+ Tuple,
+ Type,
+ TypeVar,
+ Union,
+)
+
+import sphinx
+from docutils import nodes
+from docutils.nodes import Element, Node
+from docutils.parsers.rst import Directive, directives
+from docutils.parsers.rst.states import RSTState
+from docutils.statemachine import StringList, ViewList
+from sphinx.application import Sphinx
+from sphinx.errors import ExtensionError
+from sphinx.util import logging
+from sphinx.util.docstrings import prepare_docstring
+from sphinx.util.docutils import SphinxDirective, switch_source_input
+from sphinx.util.nodes import nested_parse_with_titles
+
+import dbusdomain
+from dbusparser import parse_dbus_xml
+
+logger = logging.getLogger(__name__)
+
+__version__ = "1.0"
+
+
+class DBusDoc:
+ def __init__(self, sphinx_directive, dbusfile):
+ self._cur_doc = None
+ self._sphinx_directive = sphinx_directive
+ self._dbusfile = dbusfile
+ self._top_node = nodes.section()
+ self.result = StringList()
+ self.indent = ""
+
+ def add_line(self, line: str, *lineno: int) -> None:
+ """Append one line of generated reST to the output."""
+ if line.strip(): # not a blank line
+ self.result.append(self.indent + line, self._dbusfile, *lineno)
+ else:
+ self.result.append("", self._dbusfile, *lineno)
+
+ def add_method(self, method):
+ self.add_line(f".. dbus:method:: {method.name}")
+ self.add_line("")
+ self.indent += " "
+ for arg in method.in_args:
+ self.add_line(f":arg {arg.signature} {arg.name}: {arg.doc_string}")
+ for arg in method.out_args:
+ self.add_line(f":ret {arg.signature} {arg.name}: {arg.doc_string}")
+ self.add_line("")
+ for line in prepare_docstring("\n" + method.doc_string):
+ self.add_line(line)
+ self.indent = self.indent[:-3]
+
+ def add_signal(self, signal):
+ self.add_line(f".. dbus:signal:: {signal.name}")
+ self.add_line("")
+ self.indent += " "
+ for arg in signal.args:
+ self.add_line(f":arg {arg.signature} {arg.name}: {arg.doc_string}")
+ self.add_line("")
+ for line in prepare_docstring("\n" + signal.doc_string):
+ self.add_line(line)
+ self.indent = self.indent[:-3]
+
+ def add_property(self, prop):
+ self.add_line(f".. dbus:property:: {prop.name}")
+ self.indent += " "
+ self.add_line(f":type: {prop.signature}")
+ access = {"read": "readonly", "write": "writeonly", "readwrite": "readwrite"}[
+ prop.access
+ ]
+ self.add_line(f":{access}:")
+ if prop.emits_changed_signal:
+ self.add_line(f":emits-changed: yes")
+ self.add_line("")
+ for line in prepare_docstring("\n" + prop.doc_string):
+ self.add_line(line)
+ self.indent = self.indent[:-3]
+
+ def add_interface(self, iface):
+ self.add_line(f".. dbus:interface:: {iface.name}")
+ self.add_line("")
+ self.indent += " "
+ for line in prepare_docstring("\n" + iface.doc_string):
+ self.add_line(line)
+ for method in iface.methods:
+ self.add_method(method)
+ for sig in iface.signals:
+ self.add_signal(sig)
+ for prop in iface.properties:
+ self.add_property(prop)
+ self.indent = self.indent[:-3]
+
+
+def parse_generated_content(state: RSTState, content: StringList) -> List[Node]:
+ """Parse a generated content by Documenter."""
+ with switch_source_input(state, content):
+ node = nodes.paragraph()
+ node.document = state.document
+ state.nested_parse(content, 0, node)
+
+ return node.children
+
+
+class DBusDocDirective(SphinxDirective):
+ """Extract documentation from the specified D-Bus XML file"""
+
+ has_content = True
+ required_arguments = 1
+ optional_arguments = 0
+ final_argument_whitespace = True
+
+ def run(self):
+ reporter = self.state.document.reporter
+
+ try:
+ source, lineno = reporter.get_source_and_line(self.lineno) # type: ignore
+ except AttributeError:
+ source, lineno = (None, None)
+
+ logger.debug("[dbusdoc] %s:%s: input:\n%s", source, lineno, self.block_text)
+
+ env = self.state.document.settings.env
+ dbusfile = env.config.qapidoc_srctree + "/" + self.arguments[0]
+ with open(dbusfile, "rb") as f:
+ xml_data = f.read()
+ xml = parse_dbus_xml(xml_data)
+ doc = DBusDoc(self, dbusfile)
+ for iface in xml:
+ doc.add_interface(iface)
+
+ result = parse_generated_content(self.state, doc.result)
+ return result
+
+
+def setup(app: Sphinx) -> Dict[str, Any]:
+ """Register dbus-doc directive with Sphinx"""
+ app.add_config_value("dbusdoc_srctree", None, "env")
+ app.add_directive("dbus-doc", DBusDocDirective)
+ dbusdomain.setup(app)
+
+ return dict(version=__version__, parallel_read_safe=True, parallel_write_safe=True)
diff --git a/docs/sphinx/dbusdomain.py b/docs/sphinx/dbusdomain.py
new file mode 100644
index 0000000000..9872fd5bf6
--- /dev/null
+++ b/docs/sphinx/dbusdomain.py
@@ -0,0 +1,410 @@
+# D-Bus sphinx domain extension
+#
+# Copyright (C) 2021, Red Hat Inc.
+#
+# SPDX-License-Identifier: LGPL-2.1-or-later
+#
+# Author: Marc-André Lureau <marcandre.lureau@redhat.com>
+
+from typing import (
+ Any,
+ Dict,
+ Iterable,
+ Iterator,
+ List,
+ NamedTuple,
+ Optional,
+ Tuple,
+ cast,
+)
+
+from docutils import nodes
+from docutils.nodes import Element, Node
+from docutils.parsers.rst import directives
+from sphinx import addnodes
+from sphinx.addnodes import desc_signature, pending_xref
+from sphinx.directives import ObjectDescription
+from sphinx.domains import Domain, Index, IndexEntry, ObjType
+from sphinx.locale import _
+from sphinx.roles import XRefRole
+from sphinx.util import nodes as node_utils
+from sphinx.util.docfields import Field, TypedField
+from sphinx.util.typing import OptionSpec
+
+
+class DBusDescription(ObjectDescription[str]):
+ """Base class for DBus objects"""
+
+ option_spec: OptionSpec = ObjectDescription.option_spec.copy()
+ option_spec.update(
+ {
+ "deprecated": directives.flag,
+ }
+ )
+
+ def get_index_text(self, modname: str, name: str) -> str:
+ """Return the text for the index entry of the object."""
+ raise NotImplementedError("must be implemented in subclasses")
+
+ def add_target_and_index(
+ self, name: str, sig: str, signode: desc_signature
+ ) -> None:
+ ifacename = self.env.ref_context.get("dbus:interface")
+ node_id = name
+ if ifacename:
+ node_id = f"{ifacename}.{node_id}"
+
+ signode["names"].append(name)
+ signode["ids"].append(node_id)
+
+ if "noindexentry" not in self.options:
+ indextext = self.get_index_text(ifacename, name)
+ if indextext:
+ self.indexnode["entries"].append(
+ ("single", indextext, node_id, "", None)
+ )
+
+ domain = cast(DBusDomain, self.env.get_domain("dbus"))
+ domain.note_object(name, self.objtype, node_id, location=signode)
+
+
+class DBusInterface(DBusDescription):
+ """
+ Implementation of ``dbus:interface``.
+ """
+
+ def get_index_text(self, ifacename: str, name: str) -> str:
+ return ifacename
+
+ def before_content(self) -> None:
+ self.env.ref_context["dbus:interface"] = self.arguments[0]
+
+ def after_content(self) -> None:
+ self.env.ref_context.pop("dbus:interface")
+
+ def handle_signature(self, sig: str, signode: desc_signature) -> str:
+ signode += addnodes.desc_annotation("interface ", "interface ")
+ signode += addnodes.desc_name(sig, sig)
+ return sig
+
+ def run(self) -> List[Node]:
+ _, node = super().run()
+ name = self.arguments[0]
+ section = nodes.section(ids=[name + "-section"])
+ section += nodes.title(name, "%s interface" % name)
+ section += node
+ return [self.indexnode, section]
+
+
+class DBusMember(DBusDescription):
+
+ signal = False
+
+
+class DBusMethod(DBusMember):
+ """
+ Implementation of ``dbus:method``.
+ """
+
+ option_spec: OptionSpec = DBusMember.option_spec.copy()
+ option_spec.update(
+ {
+ "noreply": directives.flag,
+ }
+ )
+
+ doc_field_types: List[Field] = [
+ TypedField(
+ "arg",
+ label=_("Arguments"),
+ names=("arg",),
+ rolename="arg",
+ typerolename=None,
+ typenames=("argtype", "type"),
+ ),
+ TypedField(
+ "ret",
+ label=_("Returns"),
+ names=("ret",),
+ rolename="ret",
+ typerolename=None,
+ typenames=("rettype", "type"),
+ ),
+ ]
+
+ def get_index_text(self, ifacename: str, name: str) -> str:
+ return _("%s() (%s method)") % (name, ifacename)
+
+ def handle_signature(self, sig: str, signode: desc_signature) -> str:
+ params = addnodes.desc_parameterlist()
+ returns = addnodes.desc_parameterlist()
+
+ contentnode = addnodes.desc_content()
+ self.state.nested_parse(self.content, self.content_offset, contentnode)
+ for child in contentnode:
+ if isinstance(child, nodes.field_list):
+ for field in child:
+ ty, sg, name = field[0].astext().split(None, 2)
+ param = addnodes.desc_parameter()
+ param += addnodes.desc_sig_keyword_type(sg, sg)
+ param += addnodes.desc_sig_space()
+ param += addnodes.desc_sig_name(name, name)
+ if ty == "arg":
+ params += param
+ elif ty == "ret":
+ returns += param
+
+ anno = "signal " if self.signal else "method "
+ signode += addnodes.desc_annotation(anno, anno)
+ signode += addnodes.desc_name(sig, sig)
+ signode += params
+ if not self.signal and "noreply" not in self.options:
+ ret = addnodes.desc_returns()
+ ret += returns
+ signode += ret
+
+ return sig
+
+
+class DBusSignal(DBusMethod):
+ """
+ Implementation of ``dbus:signal``.
+ """
+
+ doc_field_types: List[Field] = [
+ TypedField(
+ "arg",
+ label=_("Arguments"),
+ names=("arg",),
+ rolename="arg",
+ typerolename=None,
+ typenames=("argtype", "type"),
+ ),
+ ]
+ signal = True
+
+ def get_index_text(self, ifacename: str, name: str) -> str:
+ return _("%s() (%s signal)") % (name, ifacename)
+
+
+class DBusProperty(DBusMember):
+ """
+ Implementation of ``dbus:property``.
+ """
+
+ option_spec: OptionSpec = DBusMember.option_spec.copy()
+ option_spec.update(
+ {
+ "type": directives.unchanged,
+ "readonly": directives.flag,
+ "writeonly": directives.flag,
+ "readwrite": directives.flag,
+ "emits-changed": directives.unchanged,
+ }
+ )
+
+ doc_field_types: List[Field] = []
+
+ def get_index_text(self, ifacename: str, name: str) -> str:
+ return _("%s (%s property)") % (name, ifacename)
+
+ def transform_content(self, contentnode: addnodes.desc_content) -> None:
+ fieldlist = nodes.field_list()
+ access = None
+ if "readonly" in self.options:
+ access = _("read-only")
+ if "writeonly" in self.options:
+ access = _("write-only")
+ if "readwrite" in self.options:
+ access = _("read & write")
+ if access:
+ content = nodes.Text(access)
+ fieldname = nodes.field_name("", _("Access"))
+ fieldbody = nodes.field_body("", nodes.paragraph("", "", content))
+ field = nodes.field("", fieldname, fieldbody)
+ fieldlist += field
+ emits = self.options.get("emits-changed", None)
+ if emits:
+ content = nodes.Text(emits)
+ fieldname = nodes.field_name("", _("Emits Changed"))
+ fieldbody = nodes.field_body("", nodes.paragraph("", "", content))
+ field = nodes.field("", fieldname, fieldbody)
+ fieldlist += field
+ if len(fieldlist) > 0:
+ contentnode.insert(0, fieldlist)
+
+ def handle_signature(self, sig: str, signode: desc_signature) -> str:
+ contentnode = addnodes.desc_content()
+ self.state.nested_parse(self.content, self.content_offset, contentnode)
+ ty = self.options.get("type")
+
+ signode += addnodes.desc_annotation("property ", "property ")
+ signode += addnodes.desc_name(sig, sig)
+ signode += addnodes.desc_sig_punctuation("", ":")
+ signode += addnodes.desc_sig_keyword_type(ty, ty)
+ return sig
+
+ def run(self) -> List[Node]:
+ self.name = "dbus:member"
+ return super().run()
+
+
+class DBusXRef(XRefRole):
+ def process_link(self, env, refnode, has_explicit_title, title, target):
+ refnode["dbus:interface"] = env.ref_context.get("dbus:interface")
+ if not has_explicit_title:
+ title = title.lstrip(".") # only has a meaning for the target
+ target = target.lstrip("~") # only has a meaning for the title
+ # if the first character is a tilde, don't display the module/class
+ # parts of the contents
+ if title[0:1] == "~":
+ title = title[1:]
+ dot = title.rfind(".")
+ if dot != -1:
+ title = title[dot + 1 :]
+ # if the first character is a dot, search more specific namespaces first
+ # else search builtins first
+ if target[0:1] == ".":
+ target = target[1:]
+ refnode["refspecific"] = True
+ return title, target
+
+
+class DBusIndex(Index):
+ """
+ Index subclass to provide a D-Bus interfaces index.
+ """
+
+ name = "dbusindex"
+ localname = _("D-Bus Interfaces Index")
+ shortname = _("dbus")
+
+ def generate(
+ self, docnames: Iterable[str] = None
+ ) -> Tuple[List[Tuple[str, List[IndexEntry]]], bool]:
+ content: Dict[str, List[IndexEntry]] = {}
+ # list of prefixes to ignore
+ ignores: List[str] = self.domain.env.config["dbus_index_common_prefix"]
+ ignores = sorted(ignores, key=len, reverse=True)
+
+ ifaces = sorted(
+ [
+ x
+ for x in self.domain.data["objects"].items()
+ if x[1].objtype == "interface"
+ ],
+ key=lambda x: x[0].lower(),
+ )
+ for name, (docname, node_id, _) in ifaces:
+ if docnames and docname not in docnames:
+ continue
+
+ for ignore in ignores:
+ if name.startswith(ignore):
+ name = name[len(ignore) :]
+ stripped = ignore
+ break
+ else:
+ stripped = ""
+
+ entries = content.setdefault(name[0].lower(), [])
+ entries.append(IndexEntry(stripped + name, 0, docname, node_id, "", "", ""))
+
+ # sort by first letter
+ sorted_content = sorted(content.items())
+
+ return sorted_content, False
+
+
+class ObjectEntry(NamedTuple):
+ docname: str
+ node_id: str
+ objtype: str
+
+
+class DBusDomain(Domain):
+ """
+ Implementation of the D-Bus domain.
+ """
+
+ name = "dbus"
+ label = "D-Bus"
+ object_types: Dict[str, ObjType] = {
+ "interface": ObjType(_("interface"), "iface", "obj"),
+ "method": ObjType(_("method"), "meth", "obj"),
+ "signal": ObjType(_("signal"), "sig", "obj"),
+ "property": ObjType(_("property"), "attr", "_prop", "obj"),
+ }
+ directives = {
+ "interface": DBusInterface,
+ "method": DBusMethod,
+ "signal": DBusSignal,
+ "property": DBusProperty,
+ }
+ roles = {
+ "iface": DBusXRef(),
+ "meth": DBusXRef(),
+ "sig": DBusXRef(),
+ "prop": DBusXRef(),
+ }
+ initial_data: Dict[str, Dict[str, Tuple[Any]]] = {
+ "objects": {}, # fullname -> ObjectEntry
+ }
+ indices = [
+ DBusIndex,
+ ]
+
+ @property
+ def objects(self) -> Dict[str, ObjectEntry]:
+ return self.data.setdefault("objects", {}) # fullname -> ObjectEntry
+
+ def note_object(
+ self, name: str, objtype: str, node_id: str, location: Any = None
+ ) -> None:
+ self.objects[name] = ObjectEntry(self.env.docname, node_id, objtype)
+
+ def clear_doc(self, docname: str) -> None:
+ for fullname, obj in list(self.objects.items()):
+ if obj.docname == docname:
+ del self.objects[fullname]
+
+ def find_obj(self, typ: str, name: str) -> Optional[Tuple[str, ObjectEntry]]:
+ # skip parens
+ if name[-2:] == "()":
+ name = name[:-2]
+ if typ in ("meth", "sig", "prop"):
+ try:
+ ifacename, name = name.rsplit(".", 1)
+ except ValueError:
+ pass
+ return self.objects.get(name)
+
+ def resolve_xref(
+ self,
+ env: "BuildEnvironment",
+ fromdocname: str,
+ builder: "Builder",
+ typ: str,
+ target: str,
+ node: pending_xref,
+ contnode: Element,
+ ) -> Optional[Element]:
+ """Resolve the pending_xref *node* with the given *typ* and *target*."""
+ objdef = self.find_obj(typ, target)
+ if objdef:
+ return node_utils.make_refnode(
+ builder, fromdocname, objdef.docname, objdef.node_id, contnode
+ )
+
+ def get_objects(self) -> Iterator[Tuple[str, str, str, str, str, int]]:
+ for refname, obj in self.objects.items():
+ yield (refname, refname, obj.objtype, obj.docname, obj.node_id, 1)
+
+ def merge_domaindata(self, docnames, otherdata):
+ for name, obj in otherdata['objects'].items():
+ if obj.docname in docnames:
+ self.data['objects'][name] = obj
+
+def setup(app):
+ app.add_domain(DBusDomain)
+ app.add_config_value("dbus_index_common_prefix", [], "env")
diff --git a/docs/sphinx/dbusparser.py b/docs/sphinx/dbusparser.py
new file mode 100644
index 0000000000..024553eae7
--- /dev/null
+++ b/docs/sphinx/dbusparser.py
@@ -0,0 +1,373 @@
+# Based from "GDBus - GLib D-Bus Library":
+#
+# Copyright (C) 2008-2011 Red Hat, Inc.
+#
+# This library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# This library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General
+# Public License along with this library; if not, see <http://www.gnu.org/licenses/>.
+#
+# Author: David Zeuthen <davidz@redhat.com>
+
+import xml.parsers.expat
+
+
+class Annotation:
+ def __init__(self, key, value):
+ self.key = key
+ self.value = value
+ self.annotations = []
+ self.since = ""
+
+
+class Arg:
+ def __init__(self, name, signature):
+ self.name = name
+ self.signature = signature
+ self.annotations = []
+ self.doc_string = ""
+ self.since = ""
+
+
+class Method:
+ def __init__(self, name, h_type_implies_unix_fd=True):
+ self.name = name
+ self.h_type_implies_unix_fd = h_type_implies_unix_fd
+ self.in_args = []
+ self.out_args = []
+ self.annotations = []
+ self.doc_string = ""
+ self.since = ""
+ self.deprecated = False
+ self.unix_fd = False
+
+
+class Signal:
+ def __init__(self, name):
+ self.name = name
+ self.args = []
+ self.annotations = []
+ self.doc_string = ""
+ self.since = ""
+ self.deprecated = False
+
+
+class Property:
+ def __init__(self, name, signature, access):
+ self.name = name
+ self.signature = signature
+ self.access = access
+ self.annotations = []
+ self.arg = Arg("value", self.signature)
+ self.arg.annotations = self.annotations
+ self.readable = False
+ self.writable = False
+ if self.access == "readwrite":
+ self.readable = True
+ self.writable = True
+ elif self.access == "read":
+ self.readable = True
+ elif self.access == "write":
+ self.writable = True
+ else:
+ raise ValueError('Invalid access type "{}"'.format(self.access))
+ self.doc_string = ""
+ self.since = ""
+ self.deprecated = False
+ self.emits_changed_signal = True
+
+
+class Interface:
+ def __init__(self, name):
+ self.name = name
+ self.methods = []
+ self.signals = []
+ self.properties = []
+ self.annotations = []
+ self.doc_string = ""
+ self.doc_string_brief = ""
+ self.since = ""
+ self.deprecated = False
+
+
+class DBusXMLParser:
+ STATE_TOP = "top"
+ STATE_NODE = "node"
+ STATE_INTERFACE = "interface"
+ STATE_METHOD = "method"
+ STATE_SIGNAL = "signal"
+ STATE_PROPERTY = "property"
+ STATE_ARG = "arg"
+ STATE_ANNOTATION = "annotation"
+ STATE_IGNORED = "ignored"
+
+ def __init__(self, xml_data, h_type_implies_unix_fd=True):
+ self._parser = xml.parsers.expat.ParserCreate()
+ self._parser.CommentHandler = self.handle_comment
+ self._parser.CharacterDataHandler = self.handle_char_data
+ self._parser.StartElementHandler = self.handle_start_element
+ self._parser.EndElementHandler = self.handle_end_element
+
+ self.parsed_interfaces = []
+ self._cur_object = None
+
+ self.state = DBusXMLParser.STATE_TOP
+ self.state_stack = []
+ self._cur_object = None
+ self._cur_object_stack = []
+
+ self.doc_comment_last_symbol = ""
+
+ self._h_type_implies_unix_fd = h_type_implies_unix_fd
+
+ self._parser.Parse(xml_data)
+
+ COMMENT_STATE_BEGIN = "begin"
+ COMMENT_STATE_PARAMS = "params"
+ COMMENT_STATE_BODY = "body"
+ COMMENT_STATE_SKIP = "skip"
+
+ def handle_comment(self, data):
+ comment_state = DBusXMLParser.COMMENT_STATE_BEGIN
+ lines = data.split("\n")
+ symbol = ""
+ body = ""
+ in_para = False
+ params = {}
+ for line in lines:
+ orig_line = line
+ line = line.lstrip()
+ if comment_state == DBusXMLParser.COMMENT_STATE_BEGIN:
+ if len(line) > 0:
+ colon_index = line.find(": ")
+ if colon_index == -1:
+ if line.endswith(":"):
+ symbol = line[0 : len(line) - 1]
+ comment_state = DBusXMLParser.COMMENT_STATE_PARAMS
+ else:
+ comment_state = DBusXMLParser.COMMENT_STATE_SKIP
+ else:
+ symbol = line[0:colon_index]
+ rest_of_line = line[colon_index + 2 :].strip()
+ if len(rest_of_line) > 0:
+ body += rest_of_line + "\n"
+ comment_state = DBusXMLParser.COMMENT_STATE_PARAMS
+ elif comment_state == DBusXMLParser.COMMENT_STATE_PARAMS:
+ if line.startswith("@"):
+ colon_index = line.find(": ")
+ if colon_index == -1:
+ comment_state = DBusXMLParser.COMMENT_STATE_BODY
+ if not in_para:
+ in_para = True
+ body += orig_line + "\n"
+ else:
+ param = line[1:colon_index]
+ docs = line[colon_index + 2 :]
+ params[param] = docs
+ else:
+ comment_state = DBusXMLParser.COMMENT_STATE_BODY
+ if len(line) > 0:
+ if not in_para:
+ in_para = True
+ body += orig_line + "\n"
+ elif comment_state == DBusXMLParser.COMMENT_STATE_BODY:
+ if len(line) > 0:
+ if not in_para:
+ in_para = True
+ body += orig_line + "\n"
+ else:
+ if in_para:
+ body += "\n"
+ in_para = False
+ if in_para:
+ body += "\n"
+
+ if symbol != "":
+ self.doc_comment_last_symbol = symbol
+ self.doc_comment_params = params
+ self.doc_comment_body = body
+
+ def handle_char_data(self, data):
+ # print 'char_data=%s'%data
+ pass
+
+ def handle_start_element(self, name, attrs):
+ old_state = self.state
+ old_cur_object = self._cur_object
+ if self.state == DBusXMLParser.STATE_IGNORED:
+ self.state = DBusXMLParser.STATE_IGNORED
+ elif self.state == DBusXMLParser.STATE_TOP:
+ if name == DBusXMLParser.STATE_NODE:
+ self.state = DBusXMLParser.STATE_NODE
+ else:
+ self.state = DBusXMLParser.STATE_IGNORED
+ elif self.state == DBusXMLParser.STATE_NODE:
+ if name == DBusXMLParser.STATE_INTERFACE:
+ self.state = DBusXMLParser.STATE_INTERFACE
+ iface = Interface(attrs["name"])
+ self._cur_object = iface
+ self.parsed_interfaces.append(iface)
+ elif name == DBusXMLParser.STATE_ANNOTATION:
+ self.state = DBusXMLParser.STATE_ANNOTATION
+ anno = Annotation(attrs["name"], attrs["value"])
+ self._cur_object.annotations.append(anno)
+ self._cur_object = anno
+ else:
+ self.state = DBusXMLParser.STATE_IGNORED
+
+ # assign docs, if any
+ if "name" in attrs and self.doc_comment_last_symbol == attrs["name"]:
+ self._cur_object.doc_string = self.doc_comment_body
+ if "short_description" in self.doc_comment_params:
+ short_description = self.doc_comment_params["short_description"]
+ self._cur_object.doc_string_brief = short_description
+ if "since" in self.doc_comment_params:
+ self._cur_object.since = self.doc_comment_params["since"].strip()
+
+ elif self.state == DBusXMLParser.STATE_INTERFACE:
+ if name == DBusXMLParser.STATE_METHOD:
+ self.state = DBusXMLParser.STATE_METHOD
+ method = Method(
+ attrs["name"], h_type_implies_unix_fd=self._h_type_implies_unix_fd
+ )
+ self._cur_object.methods.append(method)
+ self._cur_object = method
+ elif name == DBusXMLParser.STATE_SIGNAL:
+ self.state = DBusXMLParser.STATE_SIGNAL
+ signal = Signal(attrs["name"])
+ self._cur_object.signals.append(signal)
+ self._cur_object = signal
+ elif name == DBusXMLParser.STATE_PROPERTY:
+ self.state = DBusXMLParser.STATE_PROPERTY
+ prop = Property(attrs["name"], attrs["type"], attrs["access"])
+ self._cur_object.properties.append(prop)
+ self._cur_object = prop
+ elif name == DBusXMLParser.STATE_ANNOTATION:
+ self.state = DBusXMLParser.STATE_ANNOTATION
+ anno = Annotation(attrs["name"], attrs["value"])
+ self._cur_object.annotations.append(anno)
+ self._cur_object = anno
+ else:
+ self.state = DBusXMLParser.STATE_IGNORED
+
+ # assign docs, if any
+ if "name" in attrs and self.doc_comment_last_symbol == attrs["name"]:
+ self._cur_object.doc_string = self.doc_comment_body
+ if "since" in self.doc_comment_params:
+ self._cur_object.since = self.doc_comment_params["since"].strip()
+
+ elif self.state == DBusXMLParser.STATE_METHOD:
+ if name == DBusXMLParser.STATE_ARG:
+ self.state = DBusXMLParser.STATE_ARG
+ arg_name = None
+ if "name" in attrs:
+ arg_name = attrs["name"]
+ arg = Arg(arg_name, attrs["type"])
+ direction = attrs.get("direction", "in")
+ if direction == "in":
+ self._cur_object.in_args.append(arg)
+ elif direction == "out":
+ self._cur_object.out_args.append(arg)
+ else:
+ raise ValueError('Invalid direction "{}"'.format(direction))
+ self._cur_object = arg
+ elif name == DBusXMLParser.STATE_ANNOTATION:
+ self.state = DBusXMLParser.STATE_ANNOTATION
+ anno = Annotation(attrs["name"], attrs["value"])
+ self._cur_object.annotations.append(anno)
+ self._cur_object = anno
+ else:
+ self.state = DBusXMLParser.STATE_IGNORED
+
+ # assign docs, if any
+ if self.doc_comment_last_symbol == old_cur_object.name:
+ if "name" in attrs and attrs["name"] in self.doc_comment_params:
+ doc_string = self.doc_comment_params[attrs["name"]]
+ if doc_string is not None:
+ self._cur_object.doc_string = doc_string
+ if "since" in self.doc_comment_params:
+ self._cur_object.since = self.doc_comment_params[
+ "since"
+ ].strip()
+
+ elif self.state == DBusXMLParser.STATE_SIGNAL:
+ if name == DBusXMLParser.STATE_ARG:
+ self.state = DBusXMLParser.STATE_ARG
+ arg_name = None
+ if "name" in attrs:
+ arg_name = attrs["name"]
+ arg = Arg(arg_name, attrs["type"])
+ self._cur_object.args.append(arg)
+ self._cur_object = arg
+ elif name == DBusXMLParser.STATE_ANNOTATION:
+ self.state = DBusXMLParser.STATE_ANNOTATION
+ anno = Annotation(attrs["name"], attrs["value"])
+ self._cur_object.annotations.append(anno)
+ self._cur_object = anno
+ else:
+ self.state = DBusXMLParser.STATE_IGNORED
+
+ # assign docs, if any
+ if self.doc_comment_last_symbol == old_cur_object.name:
+ if "name" in attrs and attrs["name"] in self.doc_comment_params:
+ doc_string = self.doc_comment_params[attrs["name"]]
+ if doc_string is not None:
+ self._cur_object.doc_string = doc_string
+ if "since" in self.doc_comment_params:
+ self._cur_object.since = self.doc_comment_params[
+ "since"
+ ].strip()
+
+ elif self.state == DBusXMLParser.STATE_PROPERTY:
+ if name == DBusXMLParser.STATE_ANNOTATION:
+ self.state = DBusXMLParser.STATE_ANNOTATION
+ anno = Annotation(attrs["name"], attrs["value"])
+ self._cur_object.annotations.append(anno)
+ self._cur_object = anno
+ else:
+ self.state = DBusXMLParser.STATE_IGNORED
+
+ elif self.state == DBusXMLParser.STATE_ARG:
+ if name == DBusXMLParser.STATE_ANNOTATION:
+ self.state = DBusXMLParser.STATE_ANNOTATION
+ anno = Annotation(attrs["name"], attrs["value"])
+ self._cur_object.annotations.append(anno)
+ self._cur_object = anno
+ else:
+ self.state = DBusXMLParser.STATE_IGNORED
+
+ elif self.state == DBusXMLParser.STATE_ANNOTATION:
+ if name == DBusXMLParser.STATE_ANNOTATION:
+ self.state = DBusXMLParser.STATE_ANNOTATION
+ anno = Annotation(attrs["name"], attrs["value"])
+ self._cur_object.annotations.append(anno)
+ self._cur_object = anno
+ else:
+ self.state = DBusXMLParser.STATE_IGNORED
+
+ else:
+ raise ValueError(
+ 'Unhandled state "{}" while entering element with name "{}"'.format(
+ self.state, name
+ )
+ )
+
+ self.state_stack.append(old_state)
+ self._cur_object_stack.append(old_cur_object)
+
+ def handle_end_element(self, name):
+ self.state = self.state_stack.pop()
+ self._cur_object = self._cur_object_stack.pop()
+
+
+def parse_dbus_xml(xml_data):
+ parser = DBusXMLParser(xml_data, True)
+ return parser.parsed_interfaces
diff --git a/docs/sphinx/depfile.py b/docs/sphinx/depfile.py
index 277fdf0f56..afdcbcec6e 100644
--- a/docs/sphinx/depfile.py
+++ b/docs/sphinx/depfile.py
@@ -12,6 +12,8 @@
import os
import sphinx
+import sys
+from pathlib import Path
__version__ = '1.0'
@@ -20,8 +22,21 @@ def get_infiles(env):
yield env.doc2path(x)
yield from ((os.path.join(env.srcdir, dep)
for dep in env.dependencies[x]))
+ for mod in sys.modules.values():
+ if hasattr(mod, '__file__'):
+ if mod.__file__:
+ yield mod.__file__
+ # this is perhaps going to include unused files:
+ for static_path in env.config.html_static_path + env.config.templates_path:
+ for path in Path(static_path).rglob('*'):
+ yield str(path)
-def write_depfile(app, env):
+
+def write_depfile(app, exception):
+ if exception:
+ return
+
+ env = app.env
if not env.config.depfile:
return
@@ -42,7 +57,7 @@ def write_depfile(app, env):
def setup(app):
app.add_config_value('depfile', None, 'env')
app.add_config_value('depfile_stamp', None, 'env')
- app.connect('env-updated', write_depfile)
+ app.connect('build-finished', write_depfile)
return dict(
version = __version__,
diff --git a/docs/sphinx/fakedbusdoc.py b/docs/sphinx/fakedbusdoc.py
new file mode 100644
index 0000000000..2d2e6ef640
--- /dev/null
+++ b/docs/sphinx/fakedbusdoc.py
@@ -0,0 +1,30 @@
+# D-Bus XML documentation extension, compatibility gunk for <sphinx4
+#
+# Copyright (C) 2021, Red Hat Inc.
+#
+# SPDX-License-Identifier: LGPL-2.1-or-later
+#
+# Author: Marc-André Lureau <marcandre.lureau@redhat.com>
+"""dbus-doc is a Sphinx extension that provides documentation from D-Bus XML."""
+
+from docutils.parsers.rst import Directive
+from sphinx.application import Sphinx
+from typing import Any, Dict
+
+
+class FakeDBusDocDirective(Directive):
+ has_content = True
+ required_arguments = 1
+
+ def run(self):
+ return []
+
+
+def setup(app: Sphinx) -> Dict[str, Any]:
+ """Register a fake dbus-doc directive with Sphinx"""
+ app.add_directive("dbus-doc", FakeDBusDocDirective)
+
+ return dict(
+ parallel_read_safe = True,
+ parallel_write_safe = True
+ )
diff --git a/docs/sphinx/hxtool.py b/docs/sphinx/hxtool.py
index fb0649a3d5..3729084a36 100644
--- a/docs/sphinx/hxtool.py
+++ b/docs/sphinx/hxtool.py
@@ -49,7 +49,7 @@ def serror(file, lnum, errtext):
def parse_directive(line):
"""Return first word of line, if any"""
- return re.split('\W', line)[0]
+ return re.split(r'\W', line)[0]
def parse_defheading(file, lnum, line):
"""Handle a DEFHEADING directive"""
@@ -78,6 +78,14 @@ def parse_archheading(file, lnum, line):
serror(file, lnum, "Invalid ARCHHEADING line")
return match.group(1)
+def parse_srst(file, lnum, line):
+ """Handle an SRST directive"""
+ # The input should be either "SRST", or "SRST(label)".
+ match = re.match(r'SRST(\((.*?)\))?', line)
+ if match is None:
+ serror(file, lnum, "Invalid SRST line")
+ return match.group(2)
+
class HxtoolDocDirective(Directive):
"""Extract rST fragments from the specified .hx file"""
required_argument = 1
@@ -113,6 +121,14 @@ class HxtoolDocDirective(Directive):
serror(hxfile, lnum, 'expected ERST, found SRST')
else:
state = HxState.RST
+ label = parse_srst(hxfile, lnum, line)
+ if label:
+ rstlist.append("", hxfile, lnum - 1)
+ # Build label as _DOCNAME-HXNAME-LABEL
+ hx = os.path.splitext(os.path.basename(hxfile))[0]
+ refline = ".. _" + env.docname + "-" + hx + \
+ "-" + label + ":"
+ rstlist.append(refline, hxfile, lnum - 1)
elif directive == 'ERST':
if state == HxState.CTEXT:
serror(hxfile, lnum, 'expected SRST, found ERST')
diff --git a/docs/sphinx/kerneldoc.py b/docs/sphinx/kerneldoc.py
index bf44215016..72c403a737 100644
--- a/docs/sphinx/kerneldoc.py
+++ b/docs/sphinx/kerneldoc.py
@@ -74,6 +74,10 @@ class KernelDocDirective(Directive):
# Sphinx versions
cmd += ['-sphinx-version', sphinx.__version__]
+ # Pass through the warnings-as-errors flag
+ if env.config.kerneldoc_werror:
+ cmd += ['-Werror']
+
filename = env.config.kerneldoc_srctree + '/' + self.arguments[0]
export_file_patterns = []
@@ -167,6 +171,7 @@ def setup(app):
app.add_config_value('kerneldoc_bin', None, 'env')
app.add_config_value('kerneldoc_srctree', None, 'env')
app.add_config_value('kerneldoc_verbosity', 1, 'env')
+ app.add_config_value('kerneldoc_werror', 0, 'env')
app.add_directive('kernel-doc', KernelDocDirective)
diff --git a/docs/sphinx/qapidoc.py b/docs/sphinx/qapidoc.py
index d791b59492..8d428c64b0 100644
--- a/docs/sphinx/qapidoc.py
+++ b/docs/sphinx/qapidoc.py
@@ -168,12 +168,6 @@ class QAPISchemaGenRSTVisitor(QAPISchemaVisitor):
# TODO drop fallbacks when undocumented members are outlawed
if section.text:
defn = section.text
- elif (variants and variants.tag_member == section.member
- and not section.member.type.doc_type()):
- values = section.member.type.member_names()
- defn = [nodes.Text('One of ')]
- defn.extend(intersperse([nodes.literal('', v) for v in values],
- nodes.Text(', ')))
else:
defn = [nodes.Text('Not documented')]
@@ -186,17 +180,13 @@ class QAPISchemaGenRSTVisitor(QAPISchemaVisitor):
if variants:
for v in variants.variants:
- if v.type.is_implicit():
- assert not v.type.base and not v.type.variants
- for m in v.type.local_members:
- term = self._nodes_for_one_member(m)
- term.extend(self._nodes_for_variant_when(variants, v))
- dlnode += self._make_dlitem(term, None)
- else:
- term = [nodes.Text('The members of '),
- nodes.literal('', v.type.doc_type())]
- term.extend(self._nodes_for_variant_when(variants, v))
- dlnode += self._make_dlitem(term, None)
+ if v.type.name == 'q_empty':
+ continue
+ assert not v.type.is_implicit()
+ term = [nodes.Text('The members of '),
+ nodes.literal('', v.type.doc_type())]
+ term.extend(self._nodes_for_variant_when(variants, v))
+ dlnode += self._make_dlitem(term, None)
if not dlnode.children:
return []
@@ -249,8 +239,8 @@ class QAPISchemaGenRSTVisitor(QAPISchemaVisitor):
seen_item = False
dlnode = nodes.definition_list()
for section in doc.features.values():
- dlnode += self._make_dlitem([nodes.literal('', section.name)],
- section.text)
+ dlnode += self._make_dlitem(
+ [nodes.literal('', section.member.name)], section.text)
seen_item = True
if not seen_item:
@@ -268,8 +258,11 @@ class QAPISchemaGenRSTVisitor(QAPISchemaVisitor):
"""Return list of doctree nodes for additional sections"""
nodelist = []
for section in doc.sections:
- snode = self._make_section(section.name)
- if section.name and section.name.startswith('Example'):
+ if section.tag and section.tag == 'TODO':
+ # Hide TODO: sections
+ continue
+ snode = self._make_section(section.tag)
+ if section.tag and section.tag.startswith('Example'):
snode += self._nodes_for_example(section.text)
else:
self._parse_text_into_node(section.text, snode)
@@ -512,7 +505,7 @@ class QAPIDocDirective(Directive):
except QAPIError as err:
# Launder QAPI parse errors into Sphinx extension errors
# so they are displayed nicely to the user
- raise ExtensionError(str(err))
+ raise ExtensionError(str(err)) from err
def do_parse(self, rstlist, node):
"""Parse rST source lines and add them to the specified node
diff --git a/docs/sphinx/qmp_lexer.py b/docs/sphinx/qmp_lexer.py
index f7e4c0e198..a59de8a079 100644
--- a/docs/sphinx/qmp_lexer.py
+++ b/docs/sphinx/qmp_lexer.py
@@ -41,3 +41,8 @@ def setup(sphinx):
sphinx.add_lexer('QMP', QMPExampleLexer)
except errors.VersionRequirementError:
sphinx.add_lexer('QMP', QMPExampleLexer())
+
+ return dict(
+ parallel_read_safe = True,
+ parallel_write_safe = True
+ )
diff --git a/docs/system/arm/aspeed.rst b/docs/system/arm/aspeed.rst
index cec87e3743..b2dea54eed 100644
--- a/docs/system/arm/aspeed.rst
+++ b/docs/system/arm/aspeed.rst
@@ -14,6 +14,7 @@ AST2400 SoC based machines :
- ``palmetto-bmc`` OpenPOWER Palmetto POWER8 BMC
- ``quanta-q71l-bmc`` OpenBMC Quanta BMC
+- ``supermicrox11-bmc`` Supermicro X11 BMC
AST2500 SoC based machines :
@@ -21,12 +22,21 @@ AST2500 SoC based machines :
- ``romulus-bmc`` OpenPOWER Romulus POWER9 BMC
- ``witherspoon-bmc`` OpenPOWER Witherspoon POWER9 BMC
- ``sonorapass-bmc`` OCP SonoraPass BMC
-- ``swift-bmc`` OpenPOWER Swift BMC POWER9
+- ``fp5280g2-bmc`` Inspur FP5280G2 BMC
+- ``g220a-bmc`` Bytedance G220A BMC
+- ``yosemitev2-bmc`` Facebook YosemiteV2 BMC
+- ``tiogapass-bmc`` Facebook Tiogapass BMC
AST2600 SoC based machines :
- ``ast2600-evb`` Aspeed AST2600 Evaluation board (Cortex-A7)
- ``tacoma-bmc`` OpenPOWER Witherspoon POWER9 AST2600 BMC
+- ``rainier-bmc`` IBM Rainier POWER10 BMC
+- ``fuji-bmc`` Facebook Fuji BMC
+- ``bletchley-bmc`` Facebook Bletchley BMC
+- ``fby35-bmc`` Facebook fby35 BMC
+- ``qcom-dc-scm-v1-bmc`` Qualcomm DC-SCM V1 BMC
+- ``qcom-firework-bmc`` Qualcomm Firework BMC
Supported devices
-----------------
@@ -35,7 +45,7 @@ Supported devices
* Interrupt Controller (VIC)
* Timer Controller
* RTC Controller
- * I2C Controller
+ * I2C Controller, including the new register interface of the AST2600
* System Control Unit (SCU)
* SRAM mapping
* X-DMA Controller (basic interface)
@@ -51,39 +61,50 @@ Supported devices
* Front LEDs (PCA9552 on I2C bus)
* LPC Peripheral Controller (a subset of subdevices are supported)
* Hash/Crypto Engine (HACE) - Hash support only. TODO: HMAC and RSA
+ * ADC
+ * Secure Boot Controller (AST2600)
+ * eMMC Boot Controller (dummy)
+ * PECI Controller (minimal)
+ * I3C Controller
Missing devices
---------------
* Coprocessor support
- * ADC (out of tree implementation)
* PWM and Fan Controller
* Slave GPIO Controller
* Super I/O Controller
* PCI-Express 1 Controller
* Graphic Display Controller
- * PECI Controller
* MCTP Controller
* Mailbox Controller
* Virtual UART
* eSPI Controller
- * I3C Controller
Boot options
------------
-The Aspeed machines can be started using the ``-kernel`` option to
-load a Linux kernel or from a firmware. Images can be downloaded from
-the OpenBMC jenkins :
+The Aspeed machines can be started using the ``-kernel`` and ``-dtb`` options
+to load a Linux kernel or from a firmware. Images can be downloaded from the
+OpenBMC jenkins :
- https://jenkins.openbmc.org/job/ci-openbmc/lastSuccessfulBuild/distro=ubuntu,label=docker-builder
+ https://jenkins.openbmc.org/job/ci-openbmc/lastSuccessfulBuild/
or directly from the OpenBMC GitHub release repository :
https://github.com/openbmc/openbmc/releases
-The image should be attached as an MTD drive. Run :
+To boot a kernel directly from a Linux build tree:
+
+.. code-block:: bash
+
+ $ qemu-system-arm -M ast2600-evb -nographic \
+ -kernel arch/arm/boot/zImage \
+ -dtb arch/arm/boot/dts/aspeed-ast2600-evb.dtb \
+ -initrd rootfs.cpio
+
+To boot the machine from the flash image, use an MTD drive :
.. code-block:: bash
@@ -96,14 +117,158 @@ Options specific to Aspeed machines are :
device by using the FMC controller to load the instructions, and
not simply from RAM. This takes a little longer.
- * ``fmc-model`` to change the FMC Flash model. FW needs support for
- the chip model to boot.
+ * ``fmc-model`` to change the default FMC Flash model. FW needs
+ support for the chip model to boot.
+
+ * ``spi-model`` to change the default SPI Flash model.
- * ``spi-model`` to change the SPI Flash model.
+ * ``bmc-console`` to change the default console device. Most of the
+ machines use the ``UART5`` device for a boot console, which is
+ mapped on ``/dev/ttyS4`` under Linux, but it is not always the
+ case.
-For instance, to start the ``ast2500-evb`` machine with a different
-FMC chip and a bigger (64M) SPI chip, use :
+To use other flash models, for instance a different FMC chip and a
+bigger (64M) SPI for the ``ast2500-evb`` machine, run :
.. code-block:: bash
-M ast2500-evb,fmc-model=mx25l25635e,spi-model=mx66u51235f
+
+When more flexibility is needed to define the flash devices, to use
+different flash models or define all flash devices (up to 8), the
+``-nodefaults`` QEMU option can be used to avoid creating the default
+flash devices.
+
+Flash devices should then be created from the command line and attached
+to a block device :
+
+.. code-block:: bash
+
+ $ qemu-system-arm -M ast2600-evb \
+ -blockdev node-name=fmc0,driver=file,filename=/path/to/fmc0.img \
+ -device mx66u51235f,bus=ssi.0,cs=0x0,drive=fmc0 \
+ -blockdev node-name=fmc1,driver=file,filename=/path/to/fmc1.img \
+ -device mx66u51235f,bus=ssi.0,cs=0x1,drive=fmc1 \
+ -blockdev node-name=spi1,driver=file,filename=/path/to/spi1.img \
+ -device mx66u51235f,cs=0x0,bus=ssi.1,drive=spi1 \
+ -nographic -nodefaults
+
+In that case, the machine boots fetching instructions from the FMC0
+device. It is slower to start but closer to what HW does. Using the
+machine option ``execute-in-place`` has a similar effect.
+
+To change the boot console and use device ``UART3`` (``/dev/ttyS2``
+under Linux), use :
+
+.. code-block:: bash
+
+ -M ast2500-evb,bmc-console=uart3
+
+Aspeed minibmc family boards (``ast1030-evb``)
+==================================================================
+
+The QEMU Aspeed machines model mini BMCs of various Aspeed evaluation
+boards. They are based on different releases of the
+Aspeed SoC : the AST1030 integrating an ARM Cortex M4F CPU (200MHz).
+
+The SoC comes with SRAM, SPI, I2C, etc.
+
+AST1030 SoC based machines :
+
+- ``ast1030-evb`` Aspeed AST1030 Evaluation board (Cortex-M4F)
+
+Supported devices
+-----------------
+
+ * SMP (for the AST1030 Cortex-M4F)
+ * Interrupt Controller (VIC)
+ * Timer Controller
+ * I2C Controller
+ * System Control Unit (SCU)
+ * SRAM mapping
+ * Static Memory Controller (SMC or FMC) - Only SPI Flash support
+ * SPI Memory Controller
+ * USB 2.0 Controller
+ * Watchdog Controller
+ * GPIO Controller (Master only)
+ * UART
+ * LPC Peripheral Controller (a subset of subdevices are supported)
+ * Hash/Crypto Engine (HACE) - Hash support only. TODO: HMAC and RSA
+ * ADC
+ * Secure Boot Controller
+ * PECI Controller (minimal)
+
+
+Missing devices
+---------------
+
+ * PWM and Fan Controller
+ * Slave GPIO Controller
+ * Mailbox Controller
+ * Virtual UART
+ * eSPI Controller
+ * I3C Controller
+
+Boot options
+------------
+
+The Aspeed machines can be started using the ``-kernel`` to load a
+Zephyr OS or from a firmware. Images can be downloaded from the
+ASPEED GitHub release repository :
+
+ https://github.com/AspeedTech-BMC/zephyr/releases
+
+To boot a kernel directly from a Zephyr build tree:
+
+.. code-block:: bash
+
+ $ qemu-system-arm -M ast1030-evb -nographic \
+ -kernel zephyr.elf
+
+Facebook Yosemite v3.5 Platform and CraterLake Server (``fby35``)
+==================================================================
+
+Facebook has a series of multi-node compute server designs named
+Yosemite. The most recent version released was
+`Yosemite v3 <https://www.opencompute.org/documents/ocp-yosemite-v3-platform-design-specification-1v16-pdf>`__.
+
+Yosemite v3.5 is an iteration on this design, and is very similar: there's a
+baseboard with a BMC, and 4 server slots. The new server board design termed
+"CraterLake" includes a Bridge IC (BIC), with room for expansion boards to
+include various compute accelerators (video, inferencing, etc). At the moment,
+only the first server slot's BIC is included.
+
+Yosemite v3.5 is itself a sled which fits into a 40U chassis, and 3 sleds
+can be fit into a chassis. See `here <https://www.opencompute.org/products/423/wiwynn-yosemite-v3-server>`__
+for an example.
+
+In this generation, the BMC is an AST2600 and each BIC is an AST1030. The BMC
+runs `OpenBMC <https://github.com/facebook/openbmc>`__, and the BIC runs
+`OpenBIC <https://github.com/facebook/openbic>`__.
+
+Firmware images can be retrieved from the Github releases or built from the
+source code, see the README's for instructions on that. This image uses the
+"fby35" machine recipe from OpenBMC, and the "yv35-cl" target from OpenBIC.
+Some reference images can also be found here:
+
+.. code-block:: bash
+
+ $ wget https://github.com/facebook/openbmc/releases/download/openbmc-e2294ff5d31d/fby35.mtd
+ $ wget https://github.com/peterdelevoryas/OpenBIC/releases/download/oby35-cl-2022.13.01/Y35BCL.elf
+
+Since this machine has multiple SoC's, each with their own serial console, the
+recommended way to run it is to allocate a pseudoterminal for each serial
+console and let the monitor use stdio. Also, starting in a paused state is
+useful because it allows you to attach to the pseudoterminals before the boot
+process starts.
+
+.. code-block:: bash
+
+ $ qemu-system-arm -machine fby35 \
+ -drive file=fby35.mtd,format=raw,if=mtd \
+ -device loader,file=Y35BCL.elf,addr=0,cpu-num=2 \
+ -serial pty -serial pty -serial mon:stdio \
+ -display none -S
+ $ screen /dev/tty0 # In a separate TMUX pane, terminal window, etc.
+ $ screen /dev/tty1
+ $ (qemu) c # Start the boot process once screen is setup.
diff --git a/docs/system/arm/b-l475e-iot01a.rst b/docs/system/arm/b-l475e-iot01a.rst
new file mode 100644
index 0000000000..a76c9976c5
--- /dev/null
+++ b/docs/system/arm/b-l475e-iot01a.rst
@@ -0,0 +1,45 @@
+B-L475E-IOT01A IoT Node (``b-l475e-iot01a``)
+============================================
+
+The B-L475E-IOT01A IoT Node uses the STM32L475VG SoC which is based on
+ARM Cortex-M4F core. It is part of STMicroelectronics
+:doc:`STM32 boards </system/arm/stm32>` and more specifically the STM32L4
+ultra-low power series. The STM32L4x5 chip runs at up to 80 MHz and
+integrates 128 KiB of SRAM and up to 1MiB of Flash. The B-L475E-IOT01A board
+namely features 64 Mibit QSPI Flash, BT, WiFi and RF connectivity,
+USART, I2C, SPI, CAN and USB OTG, as well as a variety of sensors.
+
+Supported devices
+"""""""""""""""""
+
+Currently B-L475E-IOT01A machine's only supports the following devices:
+
+- Cortex-M4F based STM32L4x5 SoC
+- STM32L4x5 EXTI (Extended interrupts and events controller)
+- STM32L4x5 SYSCFG (System configuration controller)
+- STM32L4x5 RCC (Reset and clock control)
+- STM32L4x5 GPIOs (General-purpose I/Os)
+- STM32L4x5 USARTs, UARTs and LPUART (Serial ports)
+
+Missing devices
+"""""""""""""""
+
+The B-L475E-IOT01A does *not* support the following devices:
+
+- Analog to Digital Converter (ADC)
+- SPI controller
+- Timer controller (TIMER)
+
+See the complete list of unimplemented peripheral devices
+in the STM32L4x5 module : ``./hw/arm/stm32l4x5_soc.c``
+
+Boot options
+""""""""""""
+
+The B-L475E-IOT01A machine can be started using the ``-kernel``
+option to load a firmware. Example:
+
+.. code-block:: bash
+
+ $ qemu-system-arm -M b-l475e-iot01a -kernel firmware.bin
+
diff --git a/docs/system/arm/bananapi_m2u.rst b/docs/system/arm/bananapi_m2u.rst
new file mode 100644
index 0000000000..587b488655
--- /dev/null
+++ b/docs/system/arm/bananapi_m2u.rst
@@ -0,0 +1,140 @@
+Banana Pi BPI-M2U (``bpim2u``)
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Banana Pi BPI-M2 Ultra is a quad-core mini single board computer built with
+Allwinner A40i/R40/V40 SoC. It features 2GB of RAM and 8GB eMMC. It also
+has onboard WiFi and BT. On the ports side, the BPI-M2 Ultra has 2 USB A
+2.0 ports, 1 USB OTG port, 1 HDMI port, 1 audio jack, a DC power port,
+and last but not least, a SATA port.
+
+Supported devices
+"""""""""""""""""
+
+The Banana Pi M2U machine supports the following devices:
+
+ * SMP (Quad Core Cortex-A7)
+ * Generic Interrupt Controller configuration
+ * SRAM mappings
+ * SDRAM controller
+ * Timer device (re-used from Allwinner A10)
+ * UART
+ * SD/MMC storage controller
+ * EMAC ethernet
+ * GMAC ethernet
+ * Clock Control Unit
+ * SATA
+ * TWI (I2C)
+ * USB 2.0
+ * Hardware Watchdog
+
+Limitations
+"""""""""""
+
+Currently, Banana Pi M2U does *not* support the following features:
+
+- Graphical output via HDMI, GPU and/or the Display Engine
+- Audio output
+- Real Time Clock
+
+Also see the 'unimplemented' array in the Allwinner R40 SoC module
+for a complete list of unimplemented I/O devices: ``./hw/arm/allwinner-r40.c``
+
+Boot options
+""""""""""""
+
+The Banana Pi M2U machine can start using the standard -kernel functionality
+for loading a Linux kernel or ELF executable. Additionally, the Banana Pi M2U
+machine can also emulate the BootROM which is present on an actual Allwinner R40
+based SoC, which loads the bootloader from a SD card, specified via the -sd
+argument to qemu-system-arm.
+
+Running mainline Linux
+""""""""""""""""""""""
+
+To build a Linux mainline kernel that can be booted by the Banana Pi M2U machine,
+simply configure the kernel using the sunxi_defconfig configuration:
+
+.. code-block:: bash
+
+ $ ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- make mrproper
+ $ ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- make sunxi_defconfig
+
+To boot the newly build linux kernel in QEMU with the Banana Pi M2U machine, use:
+
+.. code-block:: bash
+
+ $ qemu-system-arm -M bpim2u -nographic \
+ -kernel /path/to/linux/arch/arm/boot/zImage \
+ -append 'console=ttyS0,115200' \
+ -dtb /path/to/linux/arch/arm/boot/dts/sun8i-r40-bananapi-m2-ultra.dtb
+
+Banana Pi M2U images
+""""""""""""""""""""
+
+Note that the mainline kernel does not have a root filesystem. You can choose
+to build you own image with buildroot using the bananapi_m2_ultra_defconfig.
+Also see https://buildroot.org for more information.
+
+Another possibility is to run an OpenWrt image for Banana Pi M2U which
+can be downloaded from:
+
+ https://downloads.openwrt.org/releases/22.03.3/targets/sunxi/cortexa7/
+
+When using an image as an SD card, it must be resized to a power of two. This can be
+done with the ``qemu-img`` command. It is recommended to only increase the image size
+instead of shrinking it to a power of two, to avoid loss of data. For example,
+to prepare a downloaded Armbian image, first extract it and then increase
+its size to one gigabyte as follows:
+
+.. code-block:: bash
+
+ $ qemu-img resize \
+ openwrt-22.03.3-sunxi-cortexa7-sinovoip_bananapi-m2-ultra-ext4-sdcard.img \
+ 1G
+
+Instead of providing a custom Linux kernel via the -kernel command you may also
+choose to let the Banana Pi M2U machine load the bootloader from SD card, just like
+a real board would do using the BootROM. Simply pass the selected image via the -sd
+argument and remove the -kernel, -append, -dbt and -initrd arguments:
+
+.. code-block:: bash
+
+ $ qemu-system-arm -M bpim2u -nic user -nographic \
+ -sd openwrt-22.03.3-sunxi-cortexa7-sinovoip_bananapi-m2-ultra-ext4-sdcard.img
+
+Running U-Boot
+""""""""""""""
+
+U-Boot mainline can be build and configured using the Bananapi_M2_Ultra_defconfig
+using similar commands as describe above for Linux. Note that it is recommended
+for development/testing to select the following configuration setting in U-Boot:
+
+ Device Tree Control > Provider for DTB for DT Control > Embedded DTB
+
+The BootROM of allwinner R40 loading u-boot from the 8KiB offset of sdcard.
+Let's create an bootable disk image:
+
+.. code-block:: bash
+
+ $ dd if=/dev/zero of=sd.img bs=32M count=1
+ $ dd if=u-boot-sunxi-with-spl.bin of=sd.img bs=1k seek=8 conv=notrunc
+
+And then boot it.
+
+.. code-block:: bash
+
+ $ qemu-system-arm -M bpim2u -nographic -sd sd.img
+
+Banana Pi M2U integration tests
+"""""""""""""""""""""""""""""""
+
+The Banana Pi M2U machine has several integration tests included.
+To run the whole set of tests, build QEMU from source and simply
+provide the following command:
+
+.. code-block:: bash
+
+ $ cd qemu-build-dir
+ $ AVOCADO_ALLOW_LARGE_STORAGE=yes tests/venv/bin/avocado \
+ --verbose --show=app,console run -t machine:bpim2u \
+ ../tests/avocado/boot_linux_console.py
diff --git a/docs/system/arm/cpu-features.rst b/docs/system/arm/cpu-features.rst
index 584eb17097..a5fb929243 100644
--- a/docs/system/arm/cpu-features.rst
+++ b/docs/system/arm/cpu-features.rst
@@ -177,39 +177,32 @@ are named with the prefix "kvm-". KVM VCPU features may be probed,
enabled, and disabled in the same way as other CPU features. Below is
the list of KVM VCPU features and their descriptions.
- kvm-no-adjvtime By default kvm-no-adjvtime is disabled. This
- means that by default the virtual time
- adjustment is enabled (vtime is not *not*
- adjusted).
-
- When virtual time adjustment is enabled each
- time the VM transitions back to running state
- the VCPU's virtual counter is updated to ensure
- stopped time is not counted. This avoids time
- jumps surprising guest OSes and applications,
- as long as they use the virtual counter for
- timekeeping. However it has the side effect of
- the virtual and physical counters diverging.
- All timekeeping based on the virtual counter
- will appear to lag behind any timekeeping that
- does not subtract VM stopped time. The guest
- may resynchronize its virtual counter with
- other time sources as needed.
-
- Enable kvm-no-adjvtime to disable virtual time
- adjustment, also restoring the legacy (pre-5.0)
- behavior.
-
- kvm-steal-time Since v5.2, kvm-steal-time is enabled by
- default when KVM is enabled, the feature is
- supported, and the guest is 64-bit.
-
- When kvm-steal-time is enabled a 64-bit guest
- can account for time its CPUs were not running
- due to the host not scheduling the corresponding
- VCPU threads. The accounting statistics may
- influence the guest scheduler behavior and/or be
- exposed to the guest userspace.
+``kvm-no-adjvtime``
+ By default kvm-no-adjvtime is disabled. This means that by default
+ the virtual time adjustment is enabled (vtime is not *not* adjusted).
+
+ When virtual time adjustment is enabled each time the VM transitions
+ back to running state the VCPU's virtual counter is updated to
+ ensure stopped time is not counted. This avoids time jumps
+ surprising guest OSes and applications, as long as they use the
+ virtual counter for timekeeping. However it has the side effect of
+ the virtual and physical counters diverging. All timekeeping based
+ on the virtual counter will appear to lag behind any timekeeping
+ that does not subtract VM stopped time. The guest may resynchronize
+ its virtual counter with other time sources as needed.
+
+ Enable kvm-no-adjvtime to disable virtual time adjustment, also
+ restoring the legacy (pre-5.0) behavior.
+
+``kvm-steal-time``
+ Since v5.2, kvm-steal-time is enabled by default when KVM is
+ enabled, the feature is supported, and the guest is 64-bit.
+
+ When kvm-steal-time is enabled a 64-bit guest can account for time
+ its CPUs were not running due to the host not scheduling the
+ corresponding VCPU threads. The accounting statistics may influence
+ the guest scheduler behavior and/or be exposed to the guest
+ userspace.
TCG VCPU Features
=================
@@ -217,20 +210,20 @@ TCG VCPU Features
TCG VCPU features are CPU features that are specific to TCG.
Below is the list of TCG VCPU features and their descriptions.
- pauth Enable or disable ``FEAT_Pauth``, pointer
- authentication. By default, the feature is
- enabled with ``-cpu max``.
+``pauth``
+ Enable or disable ``FEAT_Pauth`` entirely.
- pauth-impdef When ``FEAT_Pauth`` is enabled, either the
- *impdef* (Implementation Defined) algorithm
- is enabled or the *architected* QARMA algorithm
- is enabled. By default the impdef algorithm
- is disabled, and QARMA is enabled.
+``pauth-impdef``
+ When ``pauth`` is enabled, select the QEMU implementation defined algorithm.
- The architected QARMA algorithm has good
- cryptographic properties, but can be quite slow
- to emulate. The impdef algorithm used by QEMU
- is non-cryptographic but significantly faster.
+``pauth-qarma3``
+ When ``pauth`` is enabled, select the architected QARMA3 algorithm.
+
+Without either ``pauth-impdef`` or ``pauth-qarma3`` enabled,
+the architected QARMA5 algorithm is used. The architected QARMA5
+and QARMA3 algorithms have good cryptographic properties, but can
+be quite slow to emulate. The impdef algorithm used by QEMU is
+non-cryptographic but significantly faster.
SVE CPU Properties
==================
@@ -288,7 +281,7 @@ SVE CPU Property Parsing Semantics
CPU Property Dependencies and Constraints").
4) If one or more vector lengths have been explicitly enabled and at
- at least one of the dependency lengths of the maximum enabled length
+ least one of the dependency lengths of the maximum enabled length
has been explicitly disabled, then an error is generated (see
constraint (2) of "SVE CPU Property Dependencies and Constraints").
@@ -376,6 +369,31 @@ verbose command lines. However, the recommended way to select vector
lengths is to explicitly enable each desired length. Therefore only
example's (1), (4), and (6) exhibit recommended uses of the properties.
+SME CPU Property Examples
+-------------------------
+
+ 1) Disable SME::
+
+ $ qemu-system-aarch64 -M virt -cpu max,sme=off
+
+ 2) Implicitly enable all vector lengths for the ``max`` CPU type::
+
+ $ qemu-system-aarch64 -M virt -cpu max
+
+ 3) Only enable the 256-bit vector length::
+
+ $ qemu-system-aarch64 -M virt -cpu max,sme256=on
+
+ 3) Enable the 256-bit and 1024-bit vector lengths::
+
+ $ qemu-system-aarch64 -M virt -cpu max,sme256=on,sme1024=on
+
+ 4) Disable the 512-bit vector length. This results in all the other
+ lengths supported by ``max`` defaulting to enabled
+ (128, 256, 1024 and 2048)::
+
+ $ qemu-system-aarch64 -M virt -cpu max,sve512=off
+
SVE User-mode Default Vector Length Property
--------------------------------------------
@@ -391,3 +409,57 @@ length supported by QEMU is 256.
If this property is set to ``-1`` then the default vector length
is set to the maximum possible length.
+
+SME CPU Properties
+==================
+
+The SME CPU properties are much like the SVE properties: ``sme`` is
+used to enable or disable the entire SME feature, and ``sme<N>`` is
+used to enable or disable specific vector lengths. Finally,
+``sme_fa64`` is used to enable or disable ``FEAT_SME_FA64``, which
+allows execution of the "full a64" instruction set while Streaming
+SVE mode is enabled.
+
+SME is not supported by KVM at this time.
+
+At least one vector length must be enabled when ``sme`` is enabled,
+and all vector lengths must be powers of 2. The maximum vector
+length supported by qemu is 2048 bits. Otherwise, there are no
+additional constraints on the set of vector lengths supported by SME.
+
+SME User-mode Default Vector Length Property
+--------------------------------------------
+
+For qemu-aarch64, the cpu property ``sme-default-vector-length=N`` is
+defined to mirror the Linux kernel parameter file
+``/proc/sys/abi/sme_default_vector_length``. The default length, ``N``,
+is in units of bytes and must be between 16 and 8192.
+If not specified, the default vector length is 32.
+
+As with ``sve-default-vector-length``, if the default length is larger
+than the maximum vector length enabled, the actual vector length will
+be reduced. If this property is set to ``-1`` then the default vector
+length is set to the maximum possible length.
+
+RME CPU Properties
+==================
+
+The status of RME support with QEMU is experimental. At this time we
+only support RME within the CPU proper, not within the SMMU or GIC.
+The feature is enabled by the CPU property ``x-rme``, with the ``x-``
+prefix present as a reminder of the experimental status, and defaults off.
+
+The method for enabling RME will change in some future QEMU release
+without notice or backward compatibility.
+
+RME Level 0 GPT Size Property
+-----------------------------
+
+To aid firmware developers in testing different possible CPU
+configurations, ``x-l0gptsz=S`` may be used to specify the value
+to encode into ``GPCCR_EL3.L0GPTSZ``, a read-only field that
+specifies the size of the Level 0 Granule Protection Table.
+Legal values for ``S`` are 30, 34, 36, and 39; the default is 30.
+
+As with ``x-rme``, the ``x-l0gptsz`` property may be renamed or
+removed in some future QEMU release.
diff --git a/docs/system/arm/cubieboard.rst b/docs/system/arm/cubieboard.rst
index 344ff8cef9..58c4a2d3ea 100644
--- a/docs/system/arm/cubieboard.rst
+++ b/docs/system/arm/cubieboard.rst
@@ -14,3 +14,5 @@ Emulated devices:
- SDHCI
- USB controller
- SATA controller
+- TWI (I2C) controller
+- Watchdog timer
diff --git a/docs/system/arm/emulation.rst b/docs/system/arm/emulation.rst
index 144dc491d9..a9ae7ede9f 100644
--- a/docs/system/arm/emulation.rst
+++ b/docs/system/arm/emulation.rst
@@ -1,3 +1,5 @@
+.. _Arm Emulation:
+
A-profile CPU architecture support
==================================
@@ -9,35 +11,80 @@ the following architecture extensions:
- FEAT_AA32HPD (AArch32 hierarchical permission disables)
- FEAT_AA32I8MM (AArch32 Int8 matrix multiplication instructions)
- FEAT_AES (AESD and AESE instructions)
+- FEAT_BBM at level 2 (Translation table break-before-make levels)
- FEAT_BF16 (AArch64 BFloat16 instructions)
- FEAT_BTI (Branch Target Identification)
+- FEAT_CRC32 (CRC32 instructions)
+- FEAT_CSV2 (Cache speculation variant 2)
+- FEAT_CSV2_1p1 (Cache speculation variant 2, version 1.1)
+- FEAT_CSV2_1p2 (Cache speculation variant 2, version 1.2)
+- FEAT_CSV2_2 (Cache speculation variant 2, version 2)
+- FEAT_CSV3 (Cache speculation variant 3)
+- FEAT_DGH (Data gathering hint)
- FEAT_DIT (Data Independent Timing instructions)
- FEAT_DPB (DC CVAP instruction)
+- FEAT_Debugv8p2 (Debug changes for v8.2)
+- FEAT_Debugv8p4 (Debug changes for v8.4)
- FEAT_DotProd (Advanced SIMD dot product instructions)
+- FEAT_DoubleFault (Double Fault Extension)
+- FEAT_E0PD (Preventing EL0 access to halves of address maps)
+- FEAT_ECV (Enhanced Counter Virtualization)
+- FEAT_EPAC (Enhanced pointer authentication)
+- FEAT_ETS (Enhanced Translation Synchronization)
+- FEAT_EVT (Enhanced Virtualization Traps)
- FEAT_FCMA (Floating-point complex number instructions)
+- FEAT_FGT (Fine-Grained Traps)
- FEAT_FHM (Floating-point half-precision multiplication instructions)
- FEAT_FP16 (Half-precision floating-point data processing)
+- FEAT_FPAC (Faulting on AUT* instructions)
+- FEAT_FPACCOMBINE (Faulting on combined pointer authentication instructions)
- FEAT_FRINTTS (Floating-point to integer instructions)
- FEAT_FlagM (Flag manipulation instructions v2)
- FEAT_FlagM2 (Enhancements to flag manipulation instructions)
+- FEAT_GTG (Guest translation granule size)
+- FEAT_HAFDBS (Hardware management of the access flag and dirty bit state)
+- FEAT_HBC (Hinted conditional branches)
+- FEAT_HCX (Support for the HCRX_EL2 register)
- FEAT_HPDS (Hierarchical permission disables)
+- FEAT_HPDS2 (Translation table page-based hardware attributes)
+- FEAT_HPMN0 (Setting of MDCR_EL2.HPMN to zero)
- FEAT_I8MM (AArch64 Int8 matrix multiplication instructions)
+- FEAT_IDST (ID space trap handling)
+- FEAT_IESB (Implicit error synchronization event)
- FEAT_JSCVT (JavaScript conversion instructions)
- FEAT_LOR (Limited ordering regions)
+- FEAT_LPA (Large Physical Address space)
+- FEAT_LPA2 (Large Physical and virtual Address space v2)
- FEAT_LRCPC (Load-acquire RCpc instructions)
- FEAT_LRCPC2 (Load-acquire RCpc instructions v2)
- FEAT_LSE (Large System Extensions)
+- FEAT_LSE2 (Large System Extensions v2)
+- FEAT_LVA (Large Virtual Address space)
+- FEAT_MOPS (Standardization of memory operations)
- FEAT_MTE (Memory Tagging Extension)
- FEAT_MTE2 (Memory Tagging Extension)
- FEAT_MTE3 (MTE Asymmetric Fault Handling)
+- FEAT_NMI (Non-maskable Interrupt)
+- FEAT_NV (Nested Virtualization)
+- FEAT_NV2 (Enhanced nested virtualization support)
+- FEAT_PACIMP (Pointer authentication - IMPLEMENTATION DEFINED algorithm)
+- FEAT_PACQARMA3 (Pointer authentication - QARMA3 algorithm)
+- FEAT_PACQARMA5 (Pointer authentication - QARMA5 algorithm)
- FEAT_PAN (Privileged access never)
- FEAT_PAN2 (AT S1E1R and AT S1E1W instruction variants affected by PSTATE.PAN)
+- FEAT_PAN3 (Support for SCTLR_ELx.EPAN)
- FEAT_PAuth (Pointer authentication)
+- FEAT_PAuth2 (Enhancements to pointer authentication)
- FEAT_PMULL (PMULL, PMULL2 instructions)
- FEAT_PMUv3p1 (PMU Extensions v3.1)
- FEAT_PMUv3p4 (PMU Extensions v3.4)
+- FEAT_PMUv3p5 (PMU Extensions v3.5)
+- FEAT_RAS (Reliability, availability, and serviceability)
+- FEAT_RASv1p1 (RAS Extension v1.1)
- FEAT_RDM (Advanced SIMD rounding double multiply accumulate instructions)
+- FEAT_RME (Realm Management Extension) (NB: support status in QEMU is experimental)
- FEAT_RNG (Random number generator)
+- FEAT_S2FWB (Stage 2 forced Write-Back)
- FEAT_SB (Speculation Barrier)
- FEAT_SEL2 (Secure EL2)
- FEAT_SHA1 (SHA1 instructions)
@@ -46,11 +93,17 @@ the following architecture extensions:
- FEAT_SHA512 (Advanced SIMD SHA512 instructions)
- FEAT_SM3 (Advanced SIMD SM3 instructions)
- FEAT_SM4 (Advanced SIMD SM4 instructions)
+- FEAT_SME (Scalable Matrix Extension)
+- FEAT_SME_FA64 (Full A64 instruction set in Streaming SVE mode)
+- FEAT_SME_F64F64 (Double-precision floating-point outer product instructions)
+- FEAT_SME_I16I64 (16-bit to 64-bit integer widening outer product instructions)
- FEAT_SPECRES (Speculation restriction instructions)
- FEAT_SSBS (Speculative Store Bypass Safe)
+- FEAT_TIDCP1 (EL0 use of IMPLEMENTATION DEFINED functionality)
- FEAT_TLBIOS (TLB invalidate instructions in Outer Shareable domain)
- FEAT_TLBIRANGE (TLB invalidate range instructions)
- FEAT_TTCNP (Translation table Common not private translations)
+- FEAT_TTL (Translation Table Level)
- FEAT_TTST (Small translation tables)
- FEAT_UAO (Unprivileged Access Override control)
- FEAT_VHE (Virtualization Host Extensions)
diff --git a/docs/system/arm/mps2.rst b/docs/system/arm/mps2.rst
index 8a75beb3a0..a305935cc4 100644
--- a/docs/system/arm/mps2.rst
+++ b/docs/system/arm/mps2.rst
@@ -1,7 +1,7 @@
-Arm MPS2 and MPS3 boards (``mps2-an385``, ``mps2-an386``, ``mps2-an500``, ``mps2-an505``, ``mps2-an511``, ``mps2-an521``, ``mps3-an524``, ``mps3-an547``)
-=========================================================================================================================================================
+Arm MPS2 and MPS3 boards (``mps2-an385``, ``mps2-an386``, ``mps2-an500``, ``mps2-an505``, ``mps2-an511``, ``mps2-an521``, ``mps3-an524``, ``mps3-an536``, ``mps3-an547``)
+=========================================================================================================================================================================
-These board models all use Arm M-profile CPUs.
+These board models use Arm M-profile or R-profile CPUs.
The Arm MPS2, MPS2+ and MPS3 dev boards are FPGA based (the 2+ has a
bigger FPGA but is otherwise the same as the 2; the 3 has a bigger
@@ -13,6 +13,8 @@ FPGA image.
QEMU models the following FPGA images:
+FPGA images using M-profile CPUs:
+
``mps2-an385``
Cortex-M3 as documented in Arm Application Note AN385
``mps2-an386``
@@ -30,6 +32,11 @@ QEMU models the following FPGA images:
``mps3-an547``
Cortex-M55 on an MPS3, as documented in Arm Application Note AN547
+FPGA images using R-profile CPUs:
+
+``mps3-an536``
+ Dual Cortex-R52 on an MPS3, as documented in Arm Application Note AN536
+
Differences between QEMU and real hardware:
- AN385/AN386 remapping of low 16K of memory to either ZBT SSRAM1 or to
@@ -45,6 +52,30 @@ Differences between QEMU and real hardware:
flash, but only as simple ROM, so attempting to rewrite the flash
from the guest will fail
- QEMU does not model the USB controller in MPS3 boards
+- AN536 does not support runtime control of CPU reset and halt via
+ the SCC CFG_REG0 register.
+- AN536 does not support enabling or disabling the flash and ATCM
+ interfaces via the SCC CFG_REG1 register.
+- AN536 does not support setting of the initial vector table
+ base address via the SCC CFG_REG6 and CFG_REG7 register config,
+ and does not provide a mechanism for specifying these values at
+ startup, so all guest images must be built to start from TCM
+ (i.e. to expect the interrupt vector base at 0 from reset).
+- AN536 defaults to only creating a single CPU; this is the equivalent
+ of the way the real FPGA image usually runs with the second Cortex-R52
+ held in halt via the initial SCC CFG_REG0 register setting. You can
+ create the second CPU with ``-smp 2``; both CPUs will then start
+ execution immediately on startup.
+
+Note that for the AN536 the first UART is accessible only by
+CPU0, and the second UART is accessible only by CPU1. The
+first UART accessible shared between both CPUs is the third
+UART. Guest software might therefore be built to use either
+the first UART or the third UART; if you don't see any output
+from the UART you are looking at, try one of the others.
+(Even if the AN536 machine is started with a single CPU and so
+no "CPU1-only UART", the UART numbering remains the same,
+with the third UART being the first of the shared ones.)
Machine-specific options
""""""""""""""""""""""""
diff --git a/docs/system/arm/nuvoton.rst b/docs/system/arm/nuvoton.rst
index adf497e679..0424cae4b0 100644
--- a/docs/system/arm/nuvoton.rst
+++ b/docs/system/arm/nuvoton.rst
@@ -21,6 +21,7 @@ Hyperscale applications. The following machines are based on this chip :
- ``quanta-gbs-bmc`` Quanta GBS server BMC
- ``quanta-gsj`` Quanta GSJ server BMC
- ``kudo-bmc`` Fii USA Kudo server BMC
+- ``mori-bmc`` Fii USA Mori server BMC
There are also two more SoCs, NPCM710 and NPCM705, which are single-core
variants of NPCM750 and NPCM730, respectively. These are currently not
@@ -48,6 +49,7 @@ Supported devices
* SMBus controller (SMBF)
* Ethernet controller (EMC)
* Tachometer
+ * Peripheral SPI controller (PSPI)
Missing devices
---------------
@@ -63,7 +65,6 @@ Missing devices
* Ethernet controller (GMAC)
* USB device (USBD)
- * Peripheral SPI controller (PSPI)
* SD/MMC host
* PECI interface
* PCI and PCIe root complex and bridges
@@ -81,9 +82,9 @@ Boot options
The Nuvoton machines can boot from an OpenBMC firmware image, or directly into
a kernel using the ``-kernel`` option. OpenBMC images for ``quanta-gsj`` and
-possibly others can be downloaded from the OpenPOWER jenkins :
+possibly others can be downloaded from the OpenBMC jenkins :
- https://openpower.xyz/
+ https://jenkins.openbmc.org/
The firmware image should be attached as an MTD drive. Example :
diff --git a/docs/system/arm/orangepi.rst b/docs/system/arm/orangepi.rst
index 6f23907fb6..9afa54213b 100644
--- a/docs/system/arm/orangepi.rst
+++ b/docs/system/arm/orangepi.rst
@@ -25,6 +25,8 @@ The Orange Pi PC machine supports the following devices:
* Clock Control Unit
* System Control module
* Security Identifier device
+ * TWI (I2C)
+ * Watchdog timer
Limitations
"""""""""""
@@ -128,7 +130,7 @@ Alternatively, you can also choose to build you own image with buildroot
using the orangepi_pc_defconfig. Also see https://buildroot.org for more information.
When using an image as an SD card, it must be resized to a power of two. This can be
-done with the qemu-img command. It is recommended to only increase the image size
+done with the ``qemu-img`` command. It is recommended to only increase the image size
instead of shrinking it to a power of two, to avoid loss of data. For example,
to prepare a downloaded Armbian image, first extract it and then increase
its size to one gigabyte as follows:
@@ -250,14 +252,14 @@ and set the following environment variables before booting:
Optionally you may save the environment variables to SD card with 'saveenv'.
To continue booting simply give the 'boot' command and NetBSD boots.
-Orange Pi PC acceptance tests
-"""""""""""""""""""""""""""""
+Orange Pi PC integration tests
+""""""""""""""""""""""""""""""
-The Orange Pi PC machine has several acceptance tests included.
+The Orange Pi PC machine has several integration tests included.
To run the whole set of tests, build QEMU from source and simply
provide the following command:
.. code-block:: bash
$ AVOCADO_ALLOW_LARGE_STORAGE=yes avocado --show=app,console run \
- -t machine:orangepi-pc tests/acceptance/boot_linux_console.py
+ -t machine:orangepi-pc tests/avocado/boot_linux_console.py
diff --git a/docs/system/arm/palm.rst b/docs/system/arm/palm.rst
index 47ff9b36d4..61bc8d34f4 100644
--- a/docs/system/arm/palm.rst
+++ b/docs/system/arm/palm.rst
@@ -14,7 +14,7 @@ following elements:
- On-chip Real Time Clock
- TI TSC2102i touchscreen controller / analog-digital converter /
- Audio CODEC, connected through MicroWire and |I2S| busses
+ Audio CODEC, connected through MicroWire and |I2S| buses
- GPIO-connected matrix keypad
diff --git a/docs/system/arm/raspi.rst b/docs/system/arm/raspi.rst
index 922fe375a6..fbec1da6a1 100644
--- a/docs/system/arm/raspi.rst
+++ b/docs/system/arm/raspi.rst
@@ -1,5 +1,5 @@
-Raspberry Pi boards (``raspi0``, ``raspi1ap``, ``raspi2b``, ``raspi3ap``, ``raspi3b``)
-======================================================================================
+Raspberry Pi boards (``raspi0``, ``raspi1ap``, ``raspi2b``, ``raspi3ap``, ``raspi3b``, ``raspi4b``)
+===================================================================================================
QEMU provides models of the following Raspberry Pi boards:
@@ -12,12 +12,13 @@ QEMU provides models of the following Raspberry Pi boards:
Cortex-A53 (4 cores), 512 MiB of RAM
``raspi3b``
Cortex-A53 (4 cores), 1 GiB of RAM
-
+``raspi4b``
+ Cortex-A72 (4 cores), 2 GiB of RAM
Implemented devices
-------------------
- * ARM1176JZF-S, Cortex-A7 or Cortex-A53 CPU
+ * ARM1176JZF-S, Cortex-A7, Cortex-A53 or Cortex-A72 CPU
* Interrupt controller
* DMA controller
* Clock and reset controller (CPRMAN)
@@ -33,11 +34,13 @@ Implemented devices
* USB2 host controller (DWC2 and MPHI)
* MailBox controller (MBOX)
* VideoCore firmware (property)
-
+ * Peripheral SPI controller (SPI)
+ * Broadcom Serial Controller (I2C)
Missing devices
---------------
- * Peripheral SPI controller (SPI)
* Analog to Digital Converter (ADC)
* Pulse Width Modulation (PWM)
+ * PCIE Root Port (raspi4b)
+ * GENET Ethernet Controller (raspi4b)
diff --git a/docs/system/arm/sbsa.rst b/docs/system/arm/sbsa.rst
index b499d7e927..2bf22a1d0b 100644
--- a/docs/system/arm/sbsa.rst
+++ b/docs/system/arm/sbsa.rst
@@ -1,17 +1,16 @@
Arm Server Base System Architecture Reference board (``sbsa-ref``)
==================================================================
-While the ``virt`` board is a generic board platform that doesn't match
-any real hardware the ``sbsa-ref`` board intends to look like real
-hardware. The `Server Base System Architecture
-<https://developer.arm.com/documentation/den0029/latest>`_ defines a
-minimum base line of hardware support and importantly how the firmware
-reports that to any operating system. It is a static system that
-reports a very minimal DT to the firmware for non-discoverable
-information about components affected by the qemu command line (i.e.
-cpus and memory). As a result it must have a firmware specifically
-built to expect a certain hardware layout (as you would in a real
-machine).
+The ``sbsa-ref`` board intends to look like real hardware (while the ``virt``
+board is a generic board platform that doesn't match any real hardware).
+
+The hardware part is defined by two specifications:
+
+ - `Base System Architecture <https://developer.arm.com/documentation/den0094/>`__ (BSA)
+ - `Server Base System Architecture <https://developer.arm.com/documentation/den0029/>`__ (SBSA)
+
+The `Arm Base Boot Requirements <https://developer.arm.com/documentation/den0044/>`__ (BBR)
+specification defines how the firmware reports that to any operating system.
It is intended to be a machine for developing firmware and testing
standards compliance with operating systems.
@@ -19,14 +18,73 @@ standards compliance with operating systems.
Supported devices
"""""""""""""""""
-The sbsa-ref board supports:
+The ``sbsa-ref`` board supports:
- A configurable number of AArch64 CPUs
- GIC version 3
- System bus AHCI controller
- - System bus EHCI controller
+ - System bus XHCI controller
- CDROM and hard disc on AHCI bus
- E1000E ethernet card on PCIe bus
- - VGA display adaptor on PCIe bus
+ - Bochs display adapter on PCIe bus
- A generic SBSA watchdog device
+
+Board to firmware interface
+"""""""""""""""""""""""""""
+
+``sbsa-ref`` is a static system that reports a very minimal devicetree to the
+firmware for non-discoverable information about system components. This
+includes both internal hardware and parts affected by the qemu command line
+(i.e. CPUs and memory). As a result it must have a firmware specifically built
+to expect a certain hardware layout (as you would in a real machine).
+
+Note
+''''
+
+QEMU provides the guest EL3 firmware with minimal information about hardware
+platform using minimalistic devicetree. This is not a Linux devicetree. It is
+not even a firmware devicetree.
+
+It is information passed from QEMU to describe the information a hardware
+platform would have other mechanisms to discover at runtime, that are affected
+by the QEMU command line.
+
+Ultimately this devicetree may be replaced by IPC calls to an emulated SCP.
+
+DeviceTree information
+''''''''''''''''''''''
+
+The devicetree reports:
+
+ - CPUs
+ - memory
+ - platform version
+ - GIC addresses
+ - NUMA node id for CPUs and memory
+
+Platform version
+''''''''''''''''
+
+The platform version is only for informing platform firmware about
+what kind of ``sbsa-ref`` board it is running on. It is neither
+a QEMU versioned machine type nor a reflection of the level of the
+SBSA/SystemReady SR support provided.
+
+The ``machine-version-major`` value is updated when changes breaking
+fw compatibility are introduced. The ``machine-version-minor`` value
+is updated when features are added that don't break fw compatibility.
+
+Platform version changes:
+
+0.0
+ Devicetree holds information about CPUs, memory and platform version.
+
+0.1
+ GIC information is present in devicetree.
+
+0.2
+ GIC ITS information is present in devicetree.
+
+0.3
+ The USB controller is an XHCI device, not EHCI.
diff --git a/docs/system/arm/stm32.rst b/docs/system/arm/stm32.rst
index 508b92cf86..3b640f3ee0 100644
--- a/docs/system/arm/stm32.rst
+++ b/docs/system/arm/stm32.rst
@@ -16,10 +16,13 @@ based on this chip :
- ``netduino2`` Netduino 2 board with STM32F205RFT6 microcontroller
-The STM32F4 series is based on ARM Cortex-M4F core. This series is pin-to-pin
-compatible with STM32F2 series. The following machines are based on this chip :
+The STM32F4 series is based on ARM Cortex-M4F core, as well as the STM32L4
+ultra-low-power series. The STM32F4 series is pin-to-pin compatible with STM32F2 series.
+The following machines are based on this ARM Cortex-M4F chip :
- ``netduinoplus2`` Netduino Plus 2 board with STM32F405RGT6 microcontroller
+- ``olimex-stm32-h405`` Olimex STM32 H405 board with STM32F405RGT6 microcontroller
+- ``b-l475e-iot01a`` :doc:`B-L475E-IOT01A IoT Node </system/arm/b-l475e-iot01a>` board with STM32L475VG microcontroller
There are many other STM32 series that are currently not supported by QEMU.
diff --git a/docs/system/arm/vexpress.rst b/docs/system/arm/vexpress.rst
index 3e3839e923..38f29c73e7 100644
--- a/docs/system/arm/vexpress.rst
+++ b/docs/system/arm/vexpress.rst
@@ -58,6 +58,9 @@ Other differences between the hardware and the QEMU model:
``vexpress-a15``, and have IRQs from 40 upwards. If a dtb is
provided on the command line then QEMU will edit it to include
suitable entries describing these transports for the guest.
+- QEMU does not currently support either dynamic or static remapping
+ of the area of memory at address 0: it is always mapped to alias
+ the first flash bank
Booting a Linux kernel
----------------------
diff --git a/docs/system/arm/virt.rst b/docs/system/arm/virt.rst
index 850787495b..26fcba00b7 100644
--- a/docs/system/arm/virt.rst
+++ b/docs/system/arm/virt.rst
@@ -52,16 +52,36 @@ Supported guest CPU types:
- ``cortex-a7`` (32-bit)
- ``cortex-a15`` (32-bit; the default)
+- ``cortex-a35`` (64-bit)
- ``cortex-a53`` (64-bit)
+- ``cortex-a55`` (64-bit)
- ``cortex-a57`` (64-bit)
- ``cortex-a72`` (64-bit)
+- ``cortex-a76`` (64-bit)
+- ``cortex-a710`` (64-bit)
- ``a64fx`` (64-bit)
- ``host`` (with KVM only)
+- ``neoverse-n1`` (64-bit)
+- ``neoverse-v1`` (64-bit)
+- ``neoverse-n2`` (64-bit)
- ``max`` (same as ``host`` for KVM; best possible emulation with TCG)
Note that the default is ``cortex-a15``, so for an AArch64 guest you must
specify a CPU type.
+Also, please note that passing ``max`` CPU (i.e. ``-cpu max``) won't
+enable all the CPU features for a given ``virt`` machine. Where a CPU
+architectural feature requires support in both the CPU itself and in the
+wider system (e.g. the MTE feature), it may not be enabled by default,
+but instead requires a machine option to enable it.
+
+For example, MTE support must be enabled with ``-machine virt,mte=on``,
+as well as by selecting an MTE-capable CPU (e.g., ``max``) with the
+``-cpu`` option.
+
+See the machine-specific options below, or check them for a given machine
+by passing the ``help`` suboption, like: ``-machine virt-9.0,help``.
+
Graphics output is available, but unlike the x86 PC machine types
there is no default display device enabled: you should select one from
the Display devices section of "-device help". The recommended option
@@ -89,21 +109,47 @@ mte
highmem
Set ``on``/``off`` to enable/disable placing devices and RAM in physical
address space above 32 bits. The default is ``on`` for machine types
- later than ``virt-2.12``.
+ later than ``virt-2.12`` when the CPU supports an address space
+ bigger than 32 bits (i.e. 64-bit CPUs, and 32-bit CPUs with the
+ Large Physical Address Extension (LPAE) feature). If you want to
+ boot a 32-bit kernel which does not have ``CONFIG_LPAE`` enabled on
+ a CPU type which implements LPAE, you will need to manually set
+ this to ``off``; otherwise some devices, such as the PCI controller,
+ will not be accessible.
+
+compact-highmem
+ Set ``on``/``off`` to enable/disable the compact layout for high memory regions.
+ The default is ``on`` for machine types later than ``virt-7.2``.
+
+highmem-redists
+ Set ``on``/``off`` to enable/disable the high memory region for GICv3 or
+ GICv4 redistributor. The default is ``on``. Setting this to ``off`` will
+ limit the maximum number of CPUs when GICv3 or GICv4 is used.
+
+highmem-ecam
+ Set ``on``/``off`` to enable/disable the high memory region for PCI ECAM.
+ The default is ``on`` for machine types later than ``virt-3.0``.
+
+highmem-mmio
+ Set ``on``/``off`` to enable/disable the high memory region for PCI MMIO.
+ The default is ``on``.
gic-version
Specify the version of the Generic Interrupt Controller (GIC) to provide.
Valid values are:
``2``
- GICv2
+ GICv2. Note that this limits the number of CPUs to 8.
``3``
- GICv3
+ GICv3. This allows up to 512 CPUs.
+ ``4``
+ GICv4. Requires ``virtualization`` to be ``on``; allows up to 317 CPUs.
``host``
Use the same GIC version the host provides, when using KVM
``max``
Use the best GIC version possible (same as host when using KVM;
- currently same as ``3``` for TCG, but this may change in future)
+ with TCG this is currently ``3`` if ``virtualization`` is ``off`` and
+ ``4`` if ``virtualization`` is ``on``, but this may change in future)
its
Set ``on``/``off`` to enable/disable ITS instantiation. The default is ``on``
@@ -121,6 +167,19 @@ ras
Set ``on``/``off`` to enable/disable reporting host memory errors to a guest
using ACPI and guest external abort exceptions. The default is off.
+dtb-randomness
+ Set ``on``/``off`` to pass random seeds via the guest DTB
+ rng-seed and kaslr-seed nodes (in both "/chosen" and
+ "/secure-chosen") to use for features like the random number
+ generator and address space randomisation. The default is
+ ``on``. You will want to disable it if your trusted boot chain
+ will verify the DTB it is passed, since this option causes the
+ DTB to be non-deterministic. It would be the responsibility of
+ the firmware to come up with a seed and pass it on if it wants to.
+
+dtb-kaslr-seed
+ A deprecated synonym for dtb-randomness.
+
Linux guest kernel configuration
""""""""""""""""""""""""""""""""
diff --git a/docs/system/arm/xenpvh.rst b/docs/system/arm/xenpvh.rst
new file mode 100644
index 0000000000..430ac2c02e
--- /dev/null
+++ b/docs/system/arm/xenpvh.rst
@@ -0,0 +1,39 @@
+Xen Device Emulation Backend (``xenpvh``)
+=========================================
+
+This machine is a little unusual compared to others as QEMU just acts
+as an IOREQ server to register/connect with Xen Hypervisor. Control of
+the VMs themselves is left to the Xen tooling.
+
+When TPM is enabled, this machine also creates a tpm-tis-device at a
+user input tpm base address, adds a TPM emulator and connects to a
+swtpm application running on host machine via chardev socket. This
+enables xenpvh to support TPM functionalities for a guest domain.
+
+More information about TPM use and installing swtpm linux application
+can be found in the :ref:`tpm-device` section.
+
+Example for starting swtpm on host machine:
+
+.. code-block:: console
+
+ mkdir /tmp/vtpm2
+ swtpm socket --tpmstate dir=/tmp/vtpm2 \
+ --ctrl type=unixio,path=/tmp/vtpm2/swtpm-sock &
+
+Sample QEMU xenpvh commands for running and connecting with Xen:
+
+.. code-block:: console
+
+ qemu-system-aarch64 -xen-domid 1 \
+ -chardev socket,id=libxl-cmd,path=qmp-libxl-1,server=on,wait=off \
+ -mon chardev=libxl-cmd,mode=control \
+ -chardev socket,id=libxenstat-cmd,path=qmp-libxenstat-1,server=on,wait=off \
+ -mon chardev=libxenstat-cmd,mode=control \
+ -xen-attach -name guest0 -vnc none -display none -nographic \
+ -machine xenpvh -m 1301 \
+ -chardev socket,id=chrtpm,path=tmp/vtpm2/swtpm-sock \
+ -tpmdev emulator,id=tpm0,chardev=chrtpm -machine tpm-base-addr=0x0C000000
+
+In above QEMU command, last two lines are for connecting xenpvh QEMU to swtpm
+via chardev socket.
diff --git a/docs/system/arm/xlnx-versal-virt.rst b/docs/system/arm/xlnx-versal-virt.rst
index 27f73500d9..0bafc76469 100644
--- a/docs/system/arm/xlnx-versal-virt.rst
+++ b/docs/system/arm/xlnx-versal-virt.rst
@@ -32,6 +32,9 @@ Implemented devices:
- OCM (256KB of On Chip Memory)
- XRAM (4MB of on chip Accelerator RAM)
- DDR memory
+- BBRAM (36 bytes of Battery-backed RAM)
+- eFUSE (3072 bytes of one-time field-programmable bit array)
+- 2 CANFDs
QEMU does not yet model any other devices, including the PL and the AI Engine.
@@ -175,3 +178,80 @@ Run the following at the U-Boot prompt:
fdt set /chosen/dom0 reg <0x00000000 0x40000000 0x0 0x03100000>
booti 30000000 - 20000000
+BBRAM File Backend
+""""""""""""""""""
+BBRAM can have an optional file backend, which must be a seekable
+binary file with a size of 36 bytes or larger. A file with all
+binary 0s is a 'blank'.
+
+To add a file-backend for the BBRAM:
+
+.. code-block:: bash
+
+ -drive if=pflash,index=0,file=versal-bbram.bin,format=raw
+
+To use a different index value, N, from default of 0, add:
+
+.. code-block:: bash
+
+ -global driver=xlnx.bbram-ctrl,property=drive-index,value=N
+
+eFUSE File Backend
+""""""""""""""""""
+eFUSE can have an optional file backend, which must be a seekable
+binary file with a size of 3072 bytes or larger. A file with all
+binary 0s is a 'blank'.
+
+To add a file-backend for the eFUSE:
+
+.. code-block:: bash
+
+ -drive if=pflash,index=1,file=versal-efuse.bin,format=raw
+
+To use a different index value, N, from default of 1, add:
+
+.. code-block:: bash
+
+ -global xlnx-efuse.drive-index=N
+
+.. warning::
+ In actual physical Versal, BBRAM and eFUSE contain sensitive data.
+ The QEMU device models do **not** encrypt nor obfuscate any data
+ when holding them in models' memory or when writing them to their
+ file backends.
+
+ Thus, a file backend should be used with caution, and 'format=luks'
+ is highly recommended (albeit with usage complexity).
+
+ Better yet, do not use actual product data when running guest image
+ on this Xilinx Versal Virt board.
+
+Using CANFDs for Versal Virt
+""""""""""""""""""""""""""""
+Versal CANFD controller is developed based on SocketCAN and QEMU CAN bus
+implementation. Bus connection and socketCAN connection for each CAN module
+can be set through command lines.
+
+To connect both CANFD0 and CANFD1 on the same bus:
+
+.. code-block:: bash
+
+ -object can-bus,id=canbus -machine canbus0=canbus -machine canbus1=canbus
+
+To connect CANFD0 and CANFD1 to separate buses:
+
+.. code-block:: bash
+
+ -object can-bus,id=canbus0 -object can-bus,id=canbus1 \
+ -machine canbus0=canbus0 -machine canbus1=canbus1
+
+The SocketCAN interface can connect to a Physical or a Virtual CAN interfaces on
+the host machine. Please check this document to learn about CAN interface on
+Linux: docs/system/devices/can.rst
+
+To connect CANFD0 and CANFD1 to host machine's CAN interface can0:
+
+.. code-block:: bash
+
+ -object can-bus,id=canbus -machine canbus0=canbus -machine canbus1=canbus
+ -object can-host-socketcan,id=canhost0,if=can0,canbus=canbus
diff --git a/docs/system/arm/xscale.rst b/docs/system/arm/xscale.rst
index d2d5949e10..e239136c3c 100644
--- a/docs/system/arm/xscale.rst
+++ b/docs/system/arm/xscale.rst
@@ -32,4 +32,4 @@ The clamshell PDA models emulation includes the following peripherals:
- Three on-chip UARTs
-- WM8750 audio CODEC on |I2C| and |I2S| busses
+- WM8750 audio CODEC on |I2C| and |I2S| buses
diff --git a/docs/system/authz.rst b/docs/system/authz.rst
index 942af39602..55b7315e49 100644
--- a/docs/system/authz.rst
+++ b/docs/system/authz.rst
@@ -77,9 +77,7 @@ To create an instance of this driver via QMP:
"arguments": {
"qom-type": "authz-simple",
"id": "authz0",
- "props": {
- "identity": "fred"
- }
+ "identity": "fred"
}
}
@@ -110,15 +108,13 @@ To create an instance of this class via QMP:
"arguments": {
"qom-type": "authz-list",
"id": "authz0",
- "props": {
- "rules": [
- { "match": "fred", "policy": "allow", "format": "exact" },
- { "match": "bob", "policy": "allow", "format": "exact" },
- { "match": "danb", "policy": "deny", "format": "exact" },
- { "match": "dan*", "policy": "allow", "format": "glob" }
- ],
- "policy": "deny"
- }
+ "rules": [
+ { "match": "fred", "policy": "allow", "format": "exact" },
+ { "match": "bob", "policy": "allow", "format": "exact" },
+ { "match": "danb", "policy": "deny", "format": "exact" },
+ { "match": "dan*", "policy": "allow", "format": "glob" }
+ ],
+ "policy": "deny"
}
}
@@ -143,10 +139,8 @@ To create an instance of this class via QMP:
"arguments": {
"qom-type": "authz-list-file",
"id": "authz0",
- "props": {
- "filename": "/etc/qemu/myvm-vnc.acl",
- "refresh": true
- }
+ "filename": "/etc/qemu/myvm-vnc.acl",
+ "refresh": true
}
}
diff --git a/docs/confidential-guest-support.txt b/docs/system/confidential-guest-support.rst
index 71d07ba57a..0c490dbda2 100644
--- a/docs/confidential-guest-support.txt
+++ b/docs/system/confidential-guest-support.rst
@@ -19,10 +19,10 @@ Running a Confidential Guest
To run a confidential guest you need to add two command line parameters:
-1. Use "-object" to create a "confidential guest support" object. The
+1. Use ``-object`` to create a "confidential guest support" object. The
type and parameters will vary with the specific mechanism to be
used
-2. Set the "confidential-guest-support" machine parameter to the ID of
+2. Set the ``confidential-guest-support`` machine parameter to the ID of
the object from (1).
Example (for AMD SEV)::
@@ -37,13 +37,8 @@ Supported mechanisms
Currently supported confidential guest mechanisms are:
-AMD Secure Encrypted Virtualization (SEV)
- docs/amd-memory-encryption.txt
-
-POWER Protected Execution Facility (PEF)
- docs/papr-pef.txt
-
-s390x Protected Virtualization (PV)
- docs/system/s390x/protvirt.rst
+* AMD Secure Encrypted Virtualization (SEV) (see :doc:`i386/amd-memory-encryption`)
+* POWER Protected Execution Facility (PEF) (see :ref:`power-papr-protected-execution-facility-pef`)
+* s390x Protected Virtualization (PV) (see :doc:`s390x/protvirt`)
Other mechanisms may be supported in future.
diff --git a/docs/system/cpu-models-x86-abi.csv b/docs/system/cpu-models-x86-abi.csv
index f3f3b60be1..38b9bae310 100644
--- a/docs/system/cpu-models-x86-abi.csv
+++ b/docs/system/cpu-models-x86-abi.csv
@@ -8,27 +8,37 @@ Cascadelake-Server-v1,✅,✅,✅,✅
Cascadelake-Server-v2,✅,✅,✅,✅
Cascadelake-Server-v3,✅,✅,✅,✅
Cascadelake-Server-v4,✅,✅,✅,✅
+Cascadelake-Server-v5,✅,✅,✅,✅
Conroe-v1,✅,,,
Cooperlake-v1,✅,✅,✅,✅
+Cooperlake-v2,✅,✅,✅,✅
Denverton-v1,✅,✅,,
Denverton-v2,✅,✅,,
+Denverton-v3,✅,✅,,
Dhyana-v1,✅,✅,✅,
+Dhyana-v2,✅,✅,✅,
+EPYC-Genoa-v1,✅,✅,✅,✅
EPYC-Milan-v1,✅,✅,✅,
+EPYC-Milan-v2,✅,✅,✅,
EPYC-Rome-v1,✅,✅,✅,
EPYC-Rome-v2,✅,✅,✅,
+EPYC-Rome-v3,✅,✅,✅,
+EPYC-Rome-v4,✅,✅,✅,
EPYC-v1,✅,✅,✅,
EPYC-v2,✅,✅,✅,
EPYC-v3,✅,✅,✅,
+EPYC-v4,✅,✅,✅,
+GraniteRapids-v1,✅,✅,✅,✅
Haswell-v1,✅,✅,✅,
Haswell-v2,✅,✅,✅,
Haswell-v3,✅,✅,✅,
Haswell-v4,✅,✅,✅,
-Icelake-Client-v1,✅,✅,✅,
-Icelake-Client-v2,✅,✅,✅,
Icelake-Server-v1,✅,✅,✅,✅
Icelake-Server-v2,✅,✅,✅,✅
Icelake-Server-v3,✅,✅,✅,✅
Icelake-Server-v4,✅,✅,✅,✅
+Icelake-Server-v5,✅,✅,✅,✅
+Icelake-Server-v6,✅,✅,✅,✅
IvyBridge-v1,✅,✅,,
IvyBridge-v2,✅,✅,,
KnightsMill-v1,✅,✅,✅,
@@ -42,15 +52,21 @@ Opteron_G5-v1,✅,✅,,
Penryn-v1,✅,,,
SandyBridge-v1,✅,✅,,
SandyBridge-v2,✅,✅,,
+SapphireRapids-v1,✅,✅,✅,✅
+SapphireRapids-v2,✅,✅,✅,✅
Skylake-Client-v1,✅,✅,✅,
Skylake-Client-v2,✅,✅,✅,
Skylake-Client-v3,✅,✅,✅,
+Skylake-Client-v4,✅,✅,✅,
Skylake-Server-v1,✅,✅,✅,✅
Skylake-Server-v2,✅,✅,✅,✅
Skylake-Server-v3,✅,✅,✅,✅
Skylake-Server-v4,✅,✅,✅,✅
+Skylake-Server-v5,✅,✅,✅,✅
Snowridge-v1,✅,✅,,
Snowridge-v2,✅,✅,,
+Snowridge-v3,✅,✅,,
+Snowridge-v4,✅,✅,,
Westmere-v1,✅,✅,,
Westmere-v2,✅,✅,,
athlon-v1,,,,
diff --git a/docs/system/cpu-models-x86.rst.inc b/docs/system/cpu-models-x86.rst.inc
index 6e8be7d79b..ba27b5683f 100644
--- a/docs/system/cpu-models-x86.rst.inc
+++ b/docs/system/cpu-models-x86.rst.inc
@@ -49,7 +49,7 @@ future OS and toolchains are likely to target newer ABIs. The
table that follows illustrates which ABI compatibility levels
can be satisfied by the QEMU CPU models. Note that the table only
lists the long term stable CPU model versions (eg Haswell-v4).
-In addition to whats listed, there are also many CPU model
+In addition to what is listed, there are also many CPU model
aliases which resolve to a different CPU model version,
depending on the machine type is in use.
@@ -58,7 +58,7 @@ depending on the machine type is in use.
.. csv-table:: x86-64 ABI compatibility levels
:file: cpu-models-x86-abi.csv
:widths: 40,15,15,15,15
- :header-rows: 2
+ :header-rows: 1
Preferred CPU models for Intel x86 hosts
diff --git a/docs/system/device-emulation.rst b/docs/system/device-emulation.rst
index 7afcfd8064..f19777411c 100644
--- a/docs/system/device-emulation.rst
+++ b/docs/system/device-emulation.rst
@@ -82,9 +82,20 @@ Emulated Devices
.. toctree::
:maxdepth: 1
+ devices/can.rst
+ devices/ccid.rst
+ devices/cxl.rst
devices/ivshmem.rst
+ devices/keyboard.rst
devices/net.rst
devices/nvme.rst
devices/usb.rst
devices/vhost-user.rst
+ devices/virtio-gpu.rst
devices/virtio-pmem.rst
+ devices/virtio-snd.rst
+ devices/vhost-user-input.rst
+ devices/vhost-user-rng.rst
+ devices/canokey.rst
+ devices/usb-u2f.rst
+ devices/igb.rst
diff --git a/docs/system/device-url-syntax.rst.inc b/docs/system/device-url-syntax.rst.inc
index d15a021508..43b5c2596b 100644
--- a/docs/system/device-url-syntax.rst.inc
+++ b/docs/system/device-url-syntax.rst.inc
@@ -15,7 +15,7 @@ These are specified using a special URL syntax.
'iqn.2008-11.org.linux-kvm[:<name>]' but this can also be set from
the command line or a configuration file.
- Since version Qemu 2.4 it is possible to specify a iSCSI request
+ Since version QEMU 2.4 it is possible to specify a iSCSI request
timeout to detect stalled requests and force a reestablishment of the
session. The timeout is specified in seconds. The default is 0 which
means no timeout. Libiscsi 1.15.0 or greater is required for this
@@ -87,8 +87,8 @@ These are specified using a special URL syntax.
``GlusterFS``
GlusterFS is a user space distributed file system. QEMU supports the
- use of GlusterFS volumes for hosting VM disk images using TCP, Unix
- Domain Sockets and RDMA transport protocols.
+ use of GlusterFS volumes for hosting VM disk images using TCP and Unix
+ Domain Sockets transport protocols.
Syntax for specifying a VM disk image on GlusterFS volume is
diff --git a/docs/can.txt b/docs/system/devices/can.rst
index 0d310237df..09121836fd 100644
--- a/docs/can.txt
+++ b/docs/system/devices/can.rst
@@ -1,13 +1,12 @@
-QEMU CAN bus emulation support
-==============================
-
+CAN Bus Emulation Support
+=========================
The CAN bus emulation provides mechanism to connect multiple
-emulated CAN controller chips together by one or multiple CAN busses
-(the controller device "canbus" parameter). The individual busses
+emulated CAN controller chips together by one or multiple CAN buses
+(the controller device "canbus" parameter). The individual buses
can be connected to host system CAN API (at this time only Linux
SocketCAN is supported).
-The concept of busses is generic and different CAN controllers
+The concept of buses is generic and different CAN controllers
can be implemented.
The initial submission implemented SJA1000 controller which
@@ -32,34 +31,39 @@ emulated environment for testing and RTEMS GSoC slot has been donated
to work on CAN hardware emulation on QEMU.
Examples how to use CAN emulation for SJA1000 based boards
-==========================================================
-
+----------------------------------------------------------
When QEMU with CAN PCI support is compiled then one of the next
CAN boards can be selected
- (1) CAN bus Kvaser PCI CAN-S (single SJA1000 channel) boad. QEMU startup options
+(1) CAN bus Kvaser PCI CAN-S (single SJA1000 channel) board. QEMU startup options::
+
-object can-bus,id=canbus0
-device kvaser_pci,canbus=canbus0
- Add "can-host-socketcan" object to connect device to host system CAN bus
+
+Add "can-host-socketcan" object to connect device to host system CAN bus::
+
-object can-host-socketcan,id=canhost0,if=can0,canbus=canbus0
- (2) CAN bus PCM-3680I PCI (dual SJA1000 channel) emulation
+(2) CAN bus PCM-3680I PCI (dual SJA1000 channel) emulation::
+
-object can-bus,id=canbus0
-device pcm3680_pci,canbus0=canbus0,canbus1=canbus0
- another example:
+Another example::
+
-object can-bus,id=canbus0
-object can-bus,id=canbus1
-device pcm3680_pci,canbus0=canbus0,canbus1=canbus1
- (3) CAN bus MIOe-3680 PCI (dual SJA1000 channel) emulation
- -device mioe3680_pci,canbus0=canbus0
+(3) CAN bus MIOe-3680 PCI (dual SJA1000 channel) emulation::
+ -device mioe3680_pci,canbus0=canbus0
The ''kvaser_pci'' board/device model is compatible with and has been tested with
-''kvaser_pci'' driver included in mainline Linux kernel.
+the ''kvaser_pci'' driver included in mainline Linux kernel.
The tested setup was Linux 4.9 kernel on the host and guest side.
-Example for qemu-system-x86_64:
+
+Example for qemu-system-x86_64::
qemu-system-x86_64 -accel kvm -kernel /boot/vmlinuz-4.9.0-4-amd64 \
-initrd ramdisk.cpio \
@@ -69,7 +73,7 @@ Example for qemu-system-x86_64:
-device kvaser_pci,canbus=canbus0 \
-nographic -append "console=ttyS0"
-Example for qemu-system-arm:
+Example for qemu-system-arm::
qemu-system-arm -cpu arm1176 -m 256 -M versatilepb \
-kernel kernel-qemu-arm1176-versatilepb \
@@ -84,24 +88,23 @@ Example for qemu-system-arm:
The CAN interface of the host system has to be configured for proper
bitrate and set up. Configuration is not propagated from emulated
devices through bus to the physical host device. Example configuration
-for 1 Mbit/s
+for 1 Mbit/s::
ip link set can0 type can bitrate 1000000
ip link set can0 up
Virtual (host local only) can interface can be used on the host
-side instead of physical interface
+side instead of physical interface::
ip link add dev can0 type vcan
The CAN interface on the host side can be used to analyze CAN
-traffic with "candump" command which is included in "can-utils".
+traffic with "candump" command which is included in "can-utils"::
candump can0
CTU CAN FD support examples
-===========================
-
+---------------------------
This open-source core provides CAN FD support. CAN FD drames are
delivered even to the host systems when SocketCAN interface is found
CAN FD capable.
@@ -113,7 +116,7 @@ on the board.
Example how to connect the canbus0-bus (virtual wire) to the host
Linux system (SocketCAN used) and to both CTU CAN FD cores emulated
on the corresponding PCI card expects that host system CAN bus
-is setup according to the previous SJA1000 section.
+is setup according to the previous SJA1000 section::
qemu-system-x86_64 -enable-kvm -kernel /boot/vmlinuz-4.19.52+ \
-initrd ramdisk.cpio \
@@ -125,7 +128,7 @@ is setup according to the previous SJA1000 section.
-device ctucan_pci,canbus0=canbus0-bus,canbus1=canbus0-bus \
-nographic
-Setup of CTU CAN FD controller in a guest Linux system
+Setup of CTU CAN FD controller in a guest Linux system::
insmod ctucanfd.ko || modprobe ctucanfd
insmod ctucanfd_pci.ko || modprobe ctucanfd_pci
@@ -150,49 +153,37 @@ Setup of CTU CAN FD controller in a guest Linux system
/bin/ip link set $ifc up
done
-The test can run for example
+The test can run for example::
candump can1
-in the guest system and next commands in the host system for basic CAN
+in the guest system and next commands in the host system for basic CAN::
cangen can0
-for CAN FD without bitrate switch
+for CAN FD without bitrate switch::
cangen can0 -f
-and with bitrate switch
+and with bitrate switch::
cangen can0 -b
-The test can be run viceversa, generate messages in the guest system and capture them
-in the host one and much more combinations.
+The test can also be run the other way around, generating messages in the
+guest system and capturing them in the host system. Other combinations are
+also possible.
Links to other resources
-========================
-
- (1) CAN related projects at Czech Technical University, Faculty of Electrical Engineering
- http://canbus.pages.fel.cvut.cz/
- (2) Repository with development can-pci branch at Czech Technical University
- https://gitlab.fel.cvut.cz/canbus/qemu-canbus
- (3) RTEMS page describing project
- https://devel.rtems.org/wiki/Developer/Simulators/QEMU/CANEmulation
- (4) RTLWS 2015 article about the project and its use with CANopen emulation
- http://cmp.felk.cvut.cz/~pisa/can/doc/rtlws-17-pisa-qemu-can.pdf
- (5) GNU/Linux, CAN and CANopen in Real-time Control Applications
- Slides from LinuxDays 2017 (include updated RTLWS 2015 content)
- https://www.linuxdays.cz/2017/video/Pavel_Pisa-CAN_canopen.pdf
- (6) Linux SocketCAN utilities
- https://github.com/linux-can/can-utils/
- (7) CTU CAN FD project including core VHDL design, Linux driver,
- test utilities etc.
- https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core
- (8) CTU CAN FD Core Datasheet Documentation
- http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/Progdokum.pdf
- (9) CTU CAN FD Core System Architecture Documentation
- http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/ctu_can_fd_architecture.pdf
- (10) CTU CAN FD Driver Documentation
- http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/driver_doc/ctucanfd-driver.html
- (11) Integration with PCIe interfacing for Intel/Altera Cyclone IV based board
- https://gitlab.fel.cvut.cz/canbus/pcie-ctu_can_fd
+------------------------
+
+ (1) `CAN related projects at Czech Technical University, Faculty of Electrical Engineering <http://canbus.pages.fel.cvut.cz>`_
+ (2) `Repository with development can-pci branch at Czech Technical University <https://gitlab.fel.cvut.cz/canbus/qemu-canbus>`_
+ (3) `RTEMS page describing project <https://devel.rtems.org/wiki/Developer/Simulators/QEMU/CANEmulation>`_
+ (4) `RTLWS 2015 article about the project and its use with CANopen emulation <http://cmp.felk.cvut.cz/~pisa/can/doc/rtlws-17-pisa-qemu-can.pdf>`_
+ (5) `GNU/Linux, CAN and CANopen in Real-time Control Applications Slides from LinuxDays 2017 (include updated RTLWS 2015 content) <https://www.linuxdays.cz/2017/video/Pavel_Pisa-CAN_canopen.pdf>`_
+ (6) `Linux SocketCAN utilities <https://github.com/linux-can/can-utils>`_
+ (7) `CTU CAN FD project including core VHDL design, Linux driver, test utilities etc. <https://gitlab.fel.cvut.cz/canbus/ctucanfd_ip_core>`_
+ (8) `CTU CAN FD Core Datasheet Documentation <http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/doc/Datasheet.pdf>`_
+ (9) `CTU CAN FD Core System Architecture Documentation <http://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/doc/System_Architecture.pdf>`_
+ (10) `CTU CAN FD Driver Documentation <https://canbus.pages.fel.cvut.cz/ctucanfd_ip_core/doc/linux_driver/build/ctucanfd-driver.html>`_
+ (11) `Integration with PCIe interfacing for Intel/Altera Cyclone IV based board <https://gitlab.fel.cvut.cz/canbus/pcie-ctu_can_fd>`_
diff --git a/docs/system/devices/canokey.rst b/docs/system/devices/canokey.rst
new file mode 100644
index 0000000000..7f3664963f
--- /dev/null
+++ b/docs/system/devices/canokey.rst
@@ -0,0 +1,158 @@
+.. _canokey:
+
+CanoKey QEMU
+------------
+
+CanoKey [1]_ is an open-source secure key with supports of
+
+* U2F / FIDO2 with Ed25519 and HMAC-secret
+* OpenPGP Card V3.4 with RSA4096, Ed25519 and more [2]_
+* PIV (NIST SP 800-73-4)
+* HOTP / TOTP
+* NDEF
+
+All these platform-independent features are in canokey-core [3]_.
+
+For different platforms, CanoKey has different implementations,
+including both hardware implementations and virtual cards:
+
+* CanoKey STM32 [4]_
+* CanoKey Pigeon [5]_
+* (virt-card) CanoKey USB/IP
+* (virt-card) CanoKey FunctionFS
+
+In QEMU, yet another CanoKey virt-card is implemented.
+CanoKey QEMU exposes itself as a USB device to the guest OS.
+
+With the same software configuration as a hardware key,
+the guest OS can use all the functionalities of a secure key as if
+there was actually an hardware key plugged in.
+
+CanoKey QEMU provides much convenience for debugging:
+
+* libcanokey-qemu supports debugging output thus developers can
+ inspect what happens inside a secure key
+* CanoKey QEMU supports trace event thus event
+* QEMU USB stack supports pcap thus USB packet between the guest
+ and key can be captured and analysed
+
+Then for developers:
+
+* For developers on software with secure key support (e.g. FIDO2, OpenPGP),
+ they can see what happens inside the secure key
+* For secure key developers, USB packets between guest OS and CanoKey
+ can be easily captured and analysed
+
+Also since this is a virtual card, it can be easily used in CI for testing
+on code coping with secure key.
+
+Building
+========
+
+libcanokey-qemu is required to use CanoKey QEMU.
+
+.. code-block:: shell
+
+ git clone https://github.com/canokeys/canokey-qemu
+ mkdir canokey-qemu/build
+ pushd canokey-qemu/build
+
+If you want to install libcanokey-qemu in a different place,
+add ``-DCMAKE_INSTALL_PREFIX=/path/to/your/place`` to cmake below.
+
+.. code-block:: shell
+
+ cmake ..
+ make
+ make install # may need sudo
+ popd
+
+Then configuring and building:
+
+.. code-block:: shell
+
+ # depending on your env, lib/pkgconfig can be lib64/pkgconfig
+ export PKG_CONFIG_PATH=/path/to/your/place/lib/pkgconfig:$PKG_CONFIG_PATH
+ ./configure --enable-canokey && make
+
+Using CanoKey QEMU
+==================
+
+CanoKey QEMU stores all its data on a file of the host specified by the argument
+when invoking qemu.
+
+.. parsed-literal::
+
+ |qemu_system| -usb -device canokey,file=$HOME/.canokey-file
+
+Note: you should keep this file carefully as it may contain your private key!
+
+The first time when the file is used, it is created and initialized by CanoKey,
+afterwards CanoKey QEMU would just read this file.
+
+After the guest OS boots, you can check that there is a USB device.
+
+For example, If the guest OS is an Linux machine. You may invoke lsusb
+and find CanoKey QEMU there:
+
+.. code-block:: shell
+
+ $ lsusb
+ Bus 001 Device 002: ID 20a0:42d4 Clay Logic CanoKey QEMU
+
+You may setup the key as guided in [6]_. The console for the key is at [7]_.
+
+Debugging
+=========
+
+CanoKey QEMU consists of two parts, ``libcanokey-qemu.so`` and ``canokey.c``,
+the latter of which resides in QEMU. The former provides core functionality
+of a secure key while the latter provides platform-dependent functions:
+USB packet handling.
+
+If you want to trace what happens inside the secure key, when compiling
+libcanokey-qemu, you should add ``-DQEMU_DEBUG_OUTPUT=ON`` in cmake command
+line:
+
+.. code-block:: shell
+
+ cmake .. -DQEMU_DEBUG_OUTPUT=ON
+
+If you want to trace events happened in canokey.c, use
+
+.. parsed-literal::
+
+ |qemu_system| --trace "canokey_*" \\
+ -usb -device canokey,file=$HOME/.canokey-file
+
+If you want to capture USB packets between the guest and the host, you can:
+
+.. parsed-literal::
+
+ |qemu_system| -usb -device canokey,file=$HOME/.canokey-file,pcap=key.pcap
+
+Limitations
+===========
+
+Currently libcanokey-qemu.so has dozens of global variables as it was originally
+designed for embedded systems. Thus one qemu instance can not have
+multiple CanoKey QEMU running, namely you can not
+
+.. parsed-literal::
+
+ |qemu_system| -usb -device canokey,file=$HOME/.canokey-file \\
+ -device canokey,file=$HOME/.canokey-file2
+
+Also, there is no lock on canokey-file, thus two CanoKey QEMU instance
+can not read one canokey-file at the same time.
+
+References
+==========
+
+.. [1] `<https://canokeys.org>`_
+.. [2] `<https://docs.canokeys.org/userguide/openpgp/#supported-algorithm>`_
+.. [3] `<https://github.com/canokeys/canokey-core>`_
+.. [4] `<https://github.com/canokeys/canokey-stm32>`_
+.. [5] `<https://github.com/canokeys/canokey-pigeon>`_
+.. [6] `<https://docs.canokeys.org/>`_
+.. [7] `<https://console.canokeys.org/>`_
diff --git a/docs/system/devices/ccid.rst b/docs/system/devices/ccid.rst
new file mode 100644
index 0000000000..3b8c2ab46a
--- /dev/null
+++ b/docs/system/devices/ccid.rst
@@ -0,0 +1,171 @@
+Chip Card Interface Device (CCID)
+=================================
+
+USB CCID device
+---------------
+The USB CCID device is a USB device implementing the CCID specification, which
+lets one connect smart card readers that implement the same spec. For more
+information see the specification::
+
+ Universal Serial Bus
+ Device Class: Smart Card
+ CCID
+ Specification for
+ Integrated Circuit(s) Cards Interface Devices
+ Revision 1.1
+ April 22rd, 2005
+
+Smartcards are used for authentication, single sign on, decryption in
+public/private schemes and digital signatures. A smartcard reader on the client
+cannot be used on a guest with simple usb passthrough since it will then not be
+available on the client, possibly locking the computer when it is "removed". On
+the other hand this device can let you use the smartcard on both the client and
+the guest machine. It is also possible to have a completely virtual smart card
+reader and smart card (i.e. not backed by a physical device) using this device.
+
+Building
+--------
+The cryptographic functions and access to the physical card is done via the
+libcacard library, whose development package must be installed prior to
+building QEMU:
+
+In redhat/fedora::
+
+ yum install libcacard-devel
+
+In ubuntu::
+
+ apt-get install libcacard-dev
+
+Configuring and building::
+
+ ./configure --enable-smartcard && make
+
+Using ccid-card-emulated with hardware
+--------------------------------------
+Assuming you have a working smartcard on the host with the current
+user, using libcacard, QEMU acts as another client using ccid-card-emulated::
+
+ qemu -usb -device usb-ccid -device ccid-card-emulated
+
+Using ccid-card-emulated with certificates stored in files
+----------------------------------------------------------
+You must create the CA and card certificates. This is a one time process.
+We use NSS certificates::
+
+ mkdir fake-smartcard
+ cd fake-smartcard
+ certutil -N -d sql:$PWD
+ certutil -S -d sql:$PWD -s "CN=Fake Smart Card CA" -x -t TC,TC,TC -n fake-smartcard-ca
+ certutil -S -d sql:$PWD -t ,, -s "CN=John Doe" -n id-cert -c fake-smartcard-ca
+ certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (signing)" --nsCertType smime -n signing-cert -c fake-smartcard-ca
+ certutil -S -d sql:$PWD -t ,, -s "CN=John Doe (encryption)" --nsCertType sslClient -n encryption-cert -c fake-smartcard-ca
+
+Note: you must have exactly three certificates.
+
+You can use the emulated card type with the certificates backend::
+
+ qemu -usb -device usb-ccid -device ccid-card-emulated,backend=certificates,db=sql:$PWD,cert1=id-cert,cert2=signing-cert,cert3=encryption-cert
+
+To use the certificates in the guest, export the CA certificate::
+
+ certutil -L -r -d sql:$PWD -o fake-smartcard-ca.cer -n fake-smartcard-ca
+
+and import it in the guest::
+
+ certutil -A -d /etc/pki/nssdb -i fake-smartcard-ca.cer -t TC,TC,TC -n fake-smartcard-ca
+
+In a Linux guest you can then use the CoolKey PKCS #11 module to access
+the card::
+
+ certutil -d /etc/pki/nssdb -L -h all
+
+It will prompt you for the PIN (which is the password you assigned to the
+certificate database early on), and then show you all three certificates
+together with the manually imported CA cert::
+
+ Certificate Nickname Trust Attributes
+ fake-smartcard-ca CT,C,C
+ John Doe:CAC ID Certificate u,u,u
+ John Doe:CAC Email Signature Certificate u,u,u
+ John Doe:CAC Email Encryption Certificate u,u,u
+
+If this does not happen, CoolKey is not installed or not registered with
+NSS. Registration can be done from Firefox or the command line::
+
+ modutil -dbdir /etc/pki/nssdb -add "CAC Module" -libfile /usr/lib64/pkcs11/libcoolkeypk11.so
+ modutil -dbdir /etc/pki/nssdb -list
+
+Using ccid-card-passthru with client side hardware
+--------------------------------------------------
+On the host specify the ccid-card-passthru device with a suitable chardev::
+
+ qemu -chardev socket,server=on,host=0.0.0.0,port=2001,id=ccid,wait=off \
+ -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid
+
+On the client run vscclient, built when you built QEMU::
+
+ vscclient <qemu-host> 2001
+
+Using ccid-card-passthru with client side certificates
+------------------------------------------------------
+This case is not particularly useful, but you can use it to debug
+your setup.
+
+Follow instructions above, except run QEMU and vscclient as follows.
+
+Run qemu as per above, and run vscclient from the "fake-smartcard"
+directory as follows::
+
+ qemu -chardev socket,server=on,host=0.0.0.0,port=2001,id=ccid,wait=off \
+ -usb -device usb-ccid -device ccid-card-passthru,chardev=ccid
+ vscclient -e "db=\"sql:$PWD\" use_hw=no soft=(,Test,CAC,,id-cert,signing-cert,encryption-cert)" <qemu-host> 2001
+
+
+Passthrough protocol scenario
+-----------------------------
+This is a typical interchange of messages when using the passthru card device.
+usb-ccid is a usb device. It defaults to an unattached usb device on startup.
+usb-ccid expects a chardev and expects the protocol defined in
+cac_card/vscard_common.h to be passed over that.
+The usb-ccid device can be in one of three modes:
+
+* detached
+* attached with no card
+* attached with card
+
+A typical interchange is (the arrow shows who started each exchange, it can be client
+originated or guest originated)::
+
+ client event | vscclient | passthru | usb-ccid | guest event
+ ------------------------------------------------------------------------------------------------
+ | VSC_Init | | |
+ | VSC_ReaderAdd | | attach |
+ | | | | sees new usb device.
+ card inserted -> | | | |
+ | VSC_ATR | insert | insert | see new card
+ | | | |
+ | VSC_APDU | VSC_APDU | | <- guest sends APDU
+ client <-> physical | | | |
+ card APDU exchange | | | |
+ client response -> | VSC_APDU | VSC_APDU | | receive APDU response
+ ...
+ [APDU<->APDU repeats several times]
+ ...
+ card removed -> | | | |
+ | VSC_CardRemove | remove | remove | card removed
+ ...
+ [(card insert, apdu's, card remove) repeat]
+ ...
+ kill/quit | | | |
+ vscclient | | | |
+ | VSC_ReaderRemove | | detach |
+ | | | | usb device removed.
+
+libcacard
+---------
+Both ccid-card-emulated and vscclient use libcacard as the card emulator.
+libcacard implements a completely virtual CAC (DoD standard for smart
+cards) compliant card and uses NSS to retrieve certificates and do
+any encryption. The backend can then be a real reader and card, or
+certificates stored in files.
diff --git a/docs/system/devices/cxl.rst b/docs/system/devices/cxl.rst
new file mode 100644
index 0000000000..10a0e9bc9f
--- /dev/null
+++ b/docs/system/devices/cxl.rst
@@ -0,0 +1,414 @@
+Compute Express Link (CXL)
+==========================
+From the view of a single host, CXL is an interconnect standard that
+targets accelerators and memory devices attached to a CXL host.
+This description will focus on those aspects visible either to
+software running on a QEMU emulated host or to the internals of
+functional emulation. As such, it will skip over many of the
+electrical and protocol elements that would be more of interest
+for real hardware and will dominate more general introductions to CXL.
+It will also completely ignore the fabric management aspects of CXL
+by considering only a single host and a static configuration.
+
+CXL shares many concepts and much of the infrastructure of PCI Express,
+with CXL Host Bridges, which have CXL Root Ports which may be directly
+attached to CXL or PCI End Points. Alternatively there may be CXL Switches
+with CXL and PCI Endpoints attached below them. In many cases additional
+control and capabilities are exposed via PCI Express interfaces.
+This sharing of interfaces and hence emulation code is reflected
+in how the devices are emulated in QEMU. In most cases the various
+CXL elements are built upon an equivalent PCIe devices.
+
+CXL devices support the following interfaces:
+
+* Most conventional PCIe interfaces
+
+ - Configuration space access
+ - BAR mapped memory accesses used for registers and mailboxes.
+ - MSI/MSI-X
+ - AER
+ - DOE mailboxes
+ - IDE
+ - Many other PCI express defined interfaces..
+
+* Memory operations
+
+ - Equivalent of accessing DRAM / NVDIMMs. Any access / feature
+ supported by the host for normal memory should also work for
+ CXL attached memory devices.
+
+* Cache operations. The are mostly irrelevant to QEMU emulation as
+ QEMU is not emulating a coherency protocol. Any emulation related
+ to these will be device specific and is out of the scope of this
+ document.
+
+CXL 2.0 Device Types
+--------------------
+CXL 2.0 End Points are often categorized into three types.
+
+**Type 1:** These support coherent caching of host memory. Example might
+be a crypto accelerators. May also have device private memory accessible
+via means such as PCI memory reads and writes to BARs.
+
+**Type 2:** These support coherent caching of host memory and host
+managed device memory (HDM) for which the coherency protocol is managed
+by the host. This is a complex topic, so for more information on CXL
+coherency see the CXL 2.0 specification.
+
+**Type 3 Memory devices:** These devices act as a means of attaching
+additional memory (HDM) to a CXL host including both volatile and
+persistent memory. The CXL topology may support interleaving across a
+number of Type 3 memory devices using HDM Decoders in the host, host
+bridge, switch upstream port and endpoints.
+
+Scope of CXL emulation in QEMU
+------------------------------
+The focus of CXL emulation is CXL revision 2.0 and later. Earlier CXL
+revisions defined a smaller set of features, leaving much of the control
+interface as implementation defined or device specific, making generic
+emulation challenging with host specific firmware being responsible
+for setup and the Endpoints being presented to operating systems
+as Root Complex Integrated End Points. CXL rev 2.0 looks a lot
+more like PCI Express, with fully specified discoverability
+of the CXL topology.
+
+CXL System components
+----------------------
+A CXL system is made up a Host with a number of 'standard components'
+the control and capabilities of which are discoverable by system software
+using means described in the CXL 2.0 specification.
+
+CXL Fixed Memory Windows (CFMW)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+A CFMW consists of a particular range of Host Physical Address space
+which is routed to particular CXL Host Bridges. At time of generic
+software initialization it will have a particularly interleaving
+configuration and associated Quality of Service Throttling Group (QTG).
+This information is available to system software, when making
+decisions about how to configure interleave across available CXL
+memory devices. It is provide as CFMW Structures (CFMWS) in
+the CXL Early Discovery Table, an ACPI table.
+
+Note: QTG 0 is the only one currently supported in QEMU.
+
+CXL Host Bridge (CXL HB)
+~~~~~~~~~~~~~~~~~~~~~~~~
+A CXL host bridge is similar to the PCIe equivalent, but with a
+specification defined register interface called CXL Host Bridge
+Component Registers (CHBCR). The location of this CHBCR MMIO
+space is described to system software via a CXL Host Bridge
+Structure (CHBS) in the CEDT ACPI table. The actual interfaces
+are identical to those used for other parts of the CXL hierarchy
+as CXL Component Registers in PCI BARs.
+
+Interfaces provided include:
+
+* Configuration of HDM Decoders to route CXL Memory accesses with
+ a particularly Host Physical Address range to the target port
+ below which the CXL device servicing that address lies. This
+ may be a mapping to a single Root Port (RP) or across a set of
+ target RPs.
+
+CXL Root Ports (CXL RP)
+~~~~~~~~~~~~~~~~~~~~~~~
+A CXL Root Port serves the same purpose as a PCIe Root Port.
+There are a number of CXL specific Designated Vendor Specific
+Extended Capabilities (DVSEC) in PCIe Configuration Space
+and associated component register access via PCI bars.
+
+CXL Switch
+~~~~~~~~~~
+Here we consider a simple CXL switch with only a single
+virtual hierarchy. Whilst more complex devices exist, their
+visibility to a particular host is generally the same as for
+a simple switch design. Hosts often have no awareness
+of complex rerouting and device pooling, they simply see
+devices being hot added or hot removed.
+
+A CXL switch has a similar architecture to those in PCIe,
+with a single upstream port, internal PCI bus and multiple
+downstream ports.
+
+Both the CXL upstream and downstream ports have CXL specific
+DVSECs in configuration space, and component registers in PCI
+BARs. The Upstream Port has the configuration interfaces for
+the HDM decoders which route incoming memory accesses to the
+appropriate downstream port.
+
+A CXL switch is created in a similar fashion to PCI switches
+by creating an upstream port (cxl-upstream) and a number of
+downstream ports on the internal switch bus (cxl-downstream).
+
+CXL Memory Devices - Type 3
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+CXL type 3 devices use a PCI class code and are intended to be supported
+by a generic operating system driver. They have HDM decoders
+though in these EP devices, the decoder is responsible not for
+routing but for translation of the incoming host physical address (HPA)
+into a Device Physical Address (DPA).
+
+CXL Memory Interleave
+---------------------
+To understand the interaction of different CXL hardware components which
+are emulated in QEMU, let us consider a memory read in a fully configured
+CXL topology. Note that system software is responsible for configuration
+of all components with the exception of the CFMWs. System software is
+responsible for allocating appropriate ranges from within the CFMWs
+and exposing those via normal memory configurations as would be done
+for system RAM.
+
+Example system topology. x marks the match in each decoder level::
+
+ |<------------------SYSTEM PHYSICAL ADDRESS MAP (1)----------------->|
+ | __________ __________________________________ __________ |
+ | | | | | | | |
+ | | CFMW 0 | | CXL Fixed Memory Window 1 | | CFMW 2 | |
+ | | HB0 only | | Configured to interleave memory | | HB1 only | |
+ | | | | memory accesses across HB0/HB1 | | | |
+ | |__________| |_____x____________________________| |__________| |
+ | | | |
+ | | | |
+ | | | |
+ | Interleave Decoder | |
+ | Matches this HB | |
+ \_____________| |_____________/
+ __________|__________ _____|_______________
+ | | | |
+ (2) | CXL HB 0 | | CXL HB 1 |
+ | HB IntLv Decoders | | HB IntLv Decoders |
+ | PCI/CXL Root Bus 0c | | PCI/CXL Root Bus 0d |
+ | | | |
+ |___x_________________| |_____________________|
+ | | | |
+ | | | |
+ A HB 0 HDM Decoder | | |
+ matches this Port | | |
+ | | | |
+ ___________|___ __________|__ __|_________ ___|_________
+ (3)| Root Port 0 | | Root Port 1 | | Root Port 2| | Root Port 3 |
+ | Appears in | | Appears in | | Appears in | | Appear in |
+ | PCI topology | | PCI topology| | PCI topo | | PCI topo |
+ | as 0c:00.0 | | as 0c:01.0 | | as de:00.0 | | as de:01.0 |
+ |_______________| |_____________| |____________| |_____________|
+ | | | |
+ | | | |
+ _____|_________ ______|______ ______|_____ ______|_______
+ (4)| x | | | | | | |
+ | CXL Type3 0 | | CXL Type3 1 | | CXL type3 2| | CLX Type 3 3 |
+ | | | | | | | |
+ | PMEM0(Vol LSA)| | PMEM1 (...) | | PMEM2 (...)| | PMEM3 (...) |
+ | Decoder to go | | | | | | |
+ | from host PA | | PCI 0e:00.0 | | PCI df:00.0| | PCI e0:00.0 |
+ | to device PA | | | | | | |
+ | PCI as 0d:00.0| | | | | | |
+ |_______________| |_____________| |____________| |______________|
+
+Notes:
+
+(1) **3 CXL Fixed Memory Windows (CFMW)** corresponding to different
+ ranges of the system physical address map. Each CFMW has
+ particular interleave setup across the CXL Host Bridges (HB)
+ CFMW0 provides uninterleaved access to HB0, CFMW2 provides
+ uninterleaved access to HB1. CFMW1 provides interleaved memory access
+ across HB0 and HB1.
+
+(2) **Two CXL Host Bridges**. Each of these has 2 CXL Root Ports and
+ programmable HDM decoders to route memory accesses either to
+ a single port or interleave them across multiple ports.
+ A complex configuration here, might be to use the following HDM
+ decoders in HB0. HDM0 routes CFMW0 requests to RP0 and hence
+ part of CXL Type3 0. HDM1 routes CFMW0 requests from a
+ different region of the CFMW0 PA range to RP2 and hence part
+ of CXL Type 3 1. HDM2 routes yet another PA range from within
+ CFMW0 to be interleaved across RP0 and RP1, providing 2 way
+ interleave of part of the memory provided by CXL Type3 0 and
+ CXL Type 3 1. HDM3 routes those interleaved accesses from
+ CFMW1 that target HB0 to RP 0 and another part of the memory of
+ CXL Type 3 0 (as part of a 2 way interleave at the system level
+ across for example CXL Type3 0 and CXL Type3 2.
+ HDM4 is used to enable system wide 4 way interleave across all
+ the present CXL type3 devices, by interleaving those (interleaved)
+ requests that HB0 receives from from CFMW1 across RP 0 and
+ RP 1 and hence to yet more regions of the memory of the
+ attached Type3 devices. Note this is a representative subset
+ of the full range of possible HDM decoder configurations in this
+ topology.
+
+(3) **Four CXL Root Ports.** In this case the CXL Type 3 devices are
+ directly attached to these ports.
+
+(4) **Four CXL Type3 memory expansion devices.** These will each have
+ HDM decoders, but in this case rather than performing interleave
+ they will take the Host Physical Addresses of accesses and map
+ them to their own local Device Physical Address Space (DPA).
+
+Example topology involving a switch::
+
+ |<------------------SYSTEM PHYSICAL ADDRESS MAP (1)----------------->|
+ | __________ __________________________________ __________ |
+ | | | | | | | |
+ | | CFMW 0 | | CXL Fixed Memory Window 1 | | CFMW 2 | |
+ | | HB0 only | | Configured to interleave memory | | HB1 only | |
+ | | | | memory accesses across HB0/HB1 | | | |
+ | |____x_____| |__________________________________| |__________| |
+ | | | |
+ | | | |
+ | | |
+ Interleave Decoder | | |
+ Matches this HB | | |
+ \_____________| |_____________/
+ __________|__________ _____|_______________
+ | | | |
+ | CXL HB 0 | | CXL HB 1 |
+ | HB IntLv Decoders | | HB IntLv Decoders |
+ | PCI/CXL Root Bus 0c | | PCI/CXL Root Bus 0d |
+ | | | |
+ |___x_________________| |_____________________|
+ | | | |
+ |
+ A HB 0 HDM Decoder
+ matches this Port
+ ___________|___
+ | Root Port 0 |
+ | Appears in |
+ | PCI topology |
+ | as 0c:00.0 |
+ |___________x___|
+ |
+ |
+ \_____________________
+ |
+ |
+ ---------------------------------------------------
+ | Switch 0 USP as PCI 0d:00.0 |
+ | USP has HDM decoder which direct traffic to |
+ | appropriate downstream port |
+ | Switch BUS appears as 0e |
+ |x__________________________________________________|
+ | | | |
+ | | | |
+ _____|_________ ______|______ ______|_____ ______|_______
+ (4)| x | | | | | | |
+ | CXL Type3 0 | | CXL Type3 1 | | CXL type3 2| | CLX Type 3 3 |
+ | | | | | | | |
+ | PMEM0(Vol LSA)| | PMEM1 (...) | | PMEM2 (...)| | PMEM3 (...) |
+ | Decoder to go | | | | | | |
+ | from host PA | | PCI 10:00.0 | | PCI 11:00.0| | PCI 12:00.0 |
+ | to device PA | | | | | | |
+ | PCI as 0f:00.0| | | | | | |
+ |_______________| |_____________| |____________| |______________|
+
+Example command lines
+---------------------
+A very simple setup with just one directly attached CXL Type 3 Persistent Memory device::
+
+ qemu-system-x86_64 -M q35,cxl=on -m 4G,maxmem=8G,slots=8 -smp 4 \
+ ...
+ -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
+ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
+ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
+ -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
+ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
+
+A very simple setup with just one directly attached CXL Type 3 Volatile Memory device::
+
+ qemu-system-x86_64 -M q35,cxl=on -m 4G,maxmem=8G,slots=8 -smp 4 \
+ ...
+ -object memory-backend-ram,id=vmem0,share=on,size=256M \
+ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
+ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
+ -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0 \
+ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
+
+The same volatile setup may optionally include an LSA region::
+
+ qemu-system-x86_64 -M q35,cxl=on -m 4G,maxmem=8G,slots=8 -smp 4 \
+ ...
+ -object memory-backend-ram,id=vmem0,share=on,size=256M \
+ -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa.raw,size=256M \
+ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
+ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
+ -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,lsa=cxl-lsa0,id=cxl-vmem0 \
+ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G
+
+A setup suitable for 4 way interleave. Only one fixed window provided, to enable 2 way
+interleave across 2 CXL host bridges. Each host bridge has 2 CXL Root Ports, with
+the CXL Type3 device directly attached (no switches).::
+
+ qemu-system-x86_64 -M q35,cxl=on -m 4G,maxmem=8G,slots=8 -smp 4 \
+ ...
+ -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest.raw,size=256M \
+ -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \
+ -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \
+ -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/cxltest4.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa4,share=on,mem-path=/tmp/lsa4.raw,size=256M \
+ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
+ -device pxb-cxl,bus_nr=222,bus=pcie.0,id=cxl.2 \
+ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
+ -device cxl-type3,bus=root_port13,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem0 \
+ -device cxl-rp,port=1,bus=cxl.1,id=root_port14,chassis=0,slot=3 \
+ -device cxl-type3,bus=root_port14,persistent-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem1 \
+ -device cxl-rp,port=0,bus=cxl.2,id=root_port15,chassis=0,slot=5 \
+ -device cxl-type3,bus=root_port15,persistent-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem2 \
+ -device cxl-rp,port=1,bus=cxl.2,id=root_port16,chassis=0,slot=6 \
+ -device cxl-type3,bus=root_port16,persistent-memdev=cxl-mem4,lsa=cxl-lsa4,id=cxl-pmem3 \
+ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.targets.1=cxl.2,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k
+
+An example of 4 devices below a switch suitable for 1, 2 or 4 way interleave::
+
+ qemu-system-x86_64 -M q35,cxl=on -m 4G,maxmem=8G,slots=8 -smp 4 \
+ ...
+ -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
+ -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest1.raw,size=256M \
+ -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \
+ -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa1.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \
+ -object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \
+ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
+ -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
+ -device cxl-rp,port=1,bus=cxl.1,id=root_port1,chassis=0,slot=1 \
+ -device cxl-upstream,bus=root_port0,id=us0 \
+ -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
+ -device cxl-type3,bus=swport0,persistent-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-pmem0 \
+ -device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
+ -device cxl-type3,bus=swport1,persistent-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-pmem1 \
+ -device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
+ -device cxl-type3,bus=swport2,persistent-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-pmem2 \
+ -device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
+ -device cxl-type3,bus=swport3,persistent-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-pmem3 \
+ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
+
+Deprecations
+------------
+
+The Type 3 device [memdev] attribute has been deprecated in favor of the
+[persistent-memdev] attributes. [memdev] will default to a persistent memory
+device for backward compatibility and is incapable of being used in combination
+with [persistent-memdev].
+
+Kernel Configuration Options
+----------------------------
+
+In Linux 5.18 the following options are necessary to make use of
+OS management of CXL memory devices as described here.
+
+* CONFIG_CXL_BUS
+* CONFIG_CXL_PCI
+* CONFIG_CXL_ACPI
+* CONFIG_CXL_PMEM
+* CONFIG_CXL_MEM
+* CONFIG_CXL_PORT
+* CONFIG_CXL_REGION
+
+References
+----------
+
+ - Consortium website for specifications etc:
+ http://www.computeexpresslink.org
+ - Compute Express Link (CXL) Specification, Revision 3.1, August 2023
diff --git a/docs/system/devices/igb.rst b/docs/system/devices/igb.rst
new file mode 100644
index 0000000000..04e79dfe54
--- /dev/null
+++ b/docs/system/devices/igb.rst
@@ -0,0 +1,73 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+.. _igb:
+
+igb
+---
+
+igb is a family of Intel's gigabit ethernet controllers. In QEMU, 82576
+emulation is implemented in particular. Its datasheet is available at [1]_.
+
+This implementation is expected to be useful to test SR-IOV networking without
+requiring physical hardware.
+
+Limitations
+===========
+
+This igb implementation was tested with Linux Test Project [2]_ and Windows HLK
+[3]_ during the initial development. Later it was also tested with DPDK Test
+Suite [4]_. The command used when testing with LTP is:
+
+.. code-block:: shell
+
+ network.sh -6mta
+
+Be aware that this implementation lacks many functionalities available with the
+actual hardware, and you may experience various failures if you try to use it
+with a different operating system other than DPDK, Linux, and Windows or if you
+try functionalities not covered by the tests.
+
+Using igb
+=========
+
+Using igb should be nothing different from using another network device. See
+:ref:`Network_emulation` in general.
+
+However, you may also need to perform additional steps to activate SR-IOV
+feature on your guest. For Linux, refer to [5]_.
+
+Developing igb
+==============
+
+igb is the successor of e1000e, and e1000e is the successor of e1000 in turn.
+As these devices are very similar, if you make a change for igb and the same
+change can be applied to e1000e and e1000, please do so.
+
+Please do not forget to run tests before submitting a change. As tests included
+in QEMU is very minimal, run some application which is likely to be affected by
+the change to confirm it works in an integrated system.
+
+Testing igb
+===========
+
+A qtest of the basic functionality is available. Run the below at the build
+directory:
+
+.. code-block:: shell
+
+ meson test qtest-x86_64/qos-test
+
+ethtool can test register accesses, interrupts, etc. It is automated as an
+Avocado test and can be ran with the following command:
+
+.. code:: shell
+
+ make check-avocado AVOCADO_TESTS=tests/avocado/netdev-ethtool.py
+
+References
+==========
+
+.. [1] https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82576eb-gigabit-ethernet-controller-datasheet.pdf
+.. [2] https://github.com/linux-test-project/ltp
+.. [3] https://learn.microsoft.com/en-us/windows-hardware/test/hlk/
+.. [4] https://doc.dpdk.org/dts/gsg/
+.. [5] https://docs.kernel.org/PCI/pci-iov-howto.html
diff --git a/docs/system/devices/ivshmem.rst b/docs/system/devices/ivshmem.rst
index b03a48afa3..ce71e25663 100644
--- a/docs/system/devices/ivshmem.rst
+++ b/docs/system/devices/ivshmem.rst
@@ -1,5 +1,3 @@
-.. _pcsys_005fivshmem:
-
Inter-VM Shared Memory device
-----------------------------
@@ -35,7 +33,7 @@ syntax when using the shared memory server is:
When using the server, the guest will be assigned a VM ID (>=0) that
allows guests using the same server to communicate via interrupts.
Guests can read their VM ID from a device register (see
-ivshmem-spec.txt).
+:doc:`../../specs/ivshmem-spec`).
Migration with ivshmem
~~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/system/devices/keyboard.rst b/docs/system/devices/keyboard.rst
new file mode 100644
index 0000000000..a8f9fbebae
--- /dev/null
+++ b/docs/system/devices/keyboard.rst
@@ -0,0 +1,129 @@
+.. _keyboard:
+
+Sparc32 keyboard
+----------------
+SUN Type 4, 5 and 5c keyboards have dip switches to choose the language layout
+of the keyboard. Solaris makes an ioctl to query the value of the dipswitches
+and uses that value to select keyboard layout. Also the SUN bios like the one
+in the file ss5.bin uses this value to support at least some keyboard layouts.
+However, the OpenBIOS provided with qemu is hardcoded to always use an
+US keyboard layout.
+
+With the escc.chnA-sunkbd-layout driver property it is possible to select
+keyboard layout. Example:
+
+-global escc.chnA-sunkbd-layout=de
+
+Depending on type of keyboard, the keyboard can have 6 or 5 dip-switches to
+select keyboard layout, giving up to 64 different layouts. Not all
+combinations are supported by Solaris and even less by Sun OpenBoot BIOS.
+
+The dip switch settings can be given as hexadecimal number, decimal number
+or in some cases as a language string. Examples:
+
+-global escc.chnA-sunkbd-layout=0x2b
+
+-global escc.chnA-sunkbd-layout=43
+
+-global escc.chnA-sunkbd-layout=sv
+
+The above 3 examples all select a swedish keyboard layout. Table 3-15 at
+https://docs.oracle.com/cd/E19683-01/806-6642/new-43/index.html explains which
+keytable file is used for different dip switch settings. The information
+in that table can be summarized in this table:
+
+.. list-table:: Language selection values for escc.chnA-sunkbd-layout
+ :widths: 10 10 10
+ :header-rows: 1
+
+ * - Hexadecimal value
+ - Decimal value
+ - Language code
+ * - 0x21
+ - 33
+ - en-us
+ * - 0x23
+ - 35
+ - fr
+ * - 0x24
+ - 36
+ - da
+ * - 0x25
+ - 37
+ - de
+ * - 0x26
+ - 38
+ - it
+ * - 0x27
+ - 39
+ - nl
+ * - 0x28
+ - 40
+ - no
+ * - 0x29
+ - 41
+ - pt
+ * - 0x2a
+ - 42
+ - es
+ * - 0x2b
+ - 43
+ - sv
+ * - 0x2c
+ - 44
+ - fr-ch
+ * - 0x2d
+ - 45
+ - de-ch
+ * - 0x2e
+ - 46
+ - en-gb
+ * - 0x2f
+ - 47
+ - ko
+ * - 0x30
+ - 48
+ - tw
+ * - 0x31
+ - 49
+ - ja
+ * - 0x32
+ - 50
+ - fr-ca
+ * - 0x33
+ - 51
+ - hu
+ * - 0x34
+ - 52
+ - pl
+ * - 0x35
+ - 53
+ - cz
+ * - 0x36
+ - 54
+ - ru
+ * - 0x37
+ - 55
+ - lv
+ * - 0x38
+ - 56
+ - tr
+ * - 0x39
+ - 57
+ - gr
+ * - 0x3a
+ - 58
+ - ar
+ * - 0x3b
+ - 59
+ - lt
+ * - 0x3c
+ - 60
+ - nl-be
+ * - 0x3c
+ - 60
+ - be
+
+Not all dip switch values have a corresponding language code and both "be" and
+"nl-be" correspond to the same dip switch value. By default, if no value is
+given to escc.chnA-sunkbd-layout 0x21 (en-us) will be used.
diff --git a/docs/system/devices/net.rst b/docs/system/devices/net.rst
index 4b2640c448..2ab516d4b0 100644
--- a/docs/system/devices/net.rst
+++ b/docs/system/devices/net.rst
@@ -1,4 +1,4 @@
-.. _pcsys_005fnetwork:
+.. _Network_Emulation:
Network emulation
-----------------
diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index bff72d1c24..d2b1ca9645 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -70,7 +70,7 @@ namespaces and additional features, the ``nvme-ns`` device must be used.
The namespaces defined by the ``nvme-ns`` device will attach to the most
recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace
-identifers are allocated automatically, starting from ``1``.
+identifiers are allocated automatically, starting from ``1``.
There are a number of parameters available:
@@ -81,6 +81,13 @@ There are a number of parameters available:
Set the UUID of the namespace. This will be reported as a "Namespace UUID"
descriptor in the Namespace Identification Descriptor List.
+``nguid``
+ Set the NGUID of the namespace. This will be reported as a "Namespace Globally
+ Unique Identifier" descriptor in the Namespace Identification Descriptor List.
+ It is specified as a string of hexadecimal digits containing exactly 16 bytes
+ or "auto" for a random value. An optional '-' separator could be used to group
+ bytes. If not specified the NGUID will remain all zeros.
+
``eui64``
Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended
Unique Identifier" descriptor in the Namespace Identification Descriptor List.
@@ -104,34 +111,38 @@ multipath I/O.
.. code-block:: console
-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
- -device nvme,serial=a,subsys=nvme-subsys-0
- -device nvme,serial=b,subsys=nvme-subsys-0
+ -device nvme,serial=deadbeef,subsys=nvme-subsys-0
+ -device nvme,serial=deadbeef,subsys=nvme-subsys-0
This will create an NVM subsystem with two controllers. Having controllers
linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
-``shared`` (default: ``off``)
+``shared`` (default: ``on`` since 6.2)
Specifies that the namespace will be attached to all controllers in the
- subsystem. If set to ``off`` (the default), the namespace will remain a
- private namespace and may only be attached to a single controller at a time.
+ subsystem. If set to ``off``, the namespace will remain a private namespace
+ and may only be attached to a single controller at a time. Shared namespaces
+ are always automatically attached to all controllers (also when controllers
+ are hotplugged).
``detached`` (default: ``off``)
If set to ``on``, the namespace will be be available in the subsystem, but
- not attached to any controllers initially.
+ not attached to any controllers initially. A shared namespace with this set
+ to ``on`` will never be automatically attached to controllers.
Thus, adding
.. code-block:: console
-drive file=nvm-1.img,if=none,id=nvm-1
- -device nvme-ns,drive=nvm-1,nsid=1,shared=on
+ -device nvme-ns,drive=nvm-1,nsid=1
-drive file=nvm-2.img,if=none,id=nvm-2
- -device nvme-ns,drive=nvm-2,nsid=3,detached=on
+ -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on
-will cause NSID 1 will be a shared namespace (due to ``shared=on``) that is
-initially attached to both controllers. NSID 3 will be a private namespace
-(i.e. only attachable to a single controller at a time) and will not be
-attached to any controller initially (due to ``detached=on``).
+will cause NSID 1 will be a shared namespace that is initially attached to both
+controllers. NSID 3 will be a private namespace due to ``shared=off`` and only
+attachable to a single controller at a time. Additionally it will not be
+attached to any controller initially (due to ``detached=on``) or to hotplugged
+controllers.
Optional Features
=================
@@ -208,6 +219,41 @@ The namespace may be configured with additional parameters
the minimum memory page size (CAP.MPSMIN). The default value (``0``)
has this property inherit the ``mdts`` value.
+Flexible Data Placement
+-----------------------
+
+The device may be configured to support TP4146 ("Flexible Data Placement") by
+configuring it (``fdp=on``) on the subsystem::
+
+ -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16
+
+The subsystem emulates a single Endurance Group, on which Flexible Data
+Placement will be supported. Also note that the device emulation deviates
+slightly from the specification, by always enabling the "FDP Mode" feature on
+the controller if the subsystems is configured for Flexible Data Placement.
+
+Enabling Flexible Data Placement on the subsyste enables the following
+parameters:
+
+``fdp.nrg`` (default: ``1``)
+ Set the number of Reclaim Groups.
+
+``fdp.nruh`` (default: ``0``)
+ Set the number of Reclaim Unit Handles. This is a mandatory parameter and
+ must be non-zero.
+
+``fdp.runs`` (default: ``96M``)
+ Set the Reclaim Unit Nominal Size. Defaults to 96 MiB.
+
+Namespaces within this subsystem may requests Reclaim Unit Handles::
+
+ -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST
+
+The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may
+include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified,
+the controller will assign the controller-specified reclaim unit handle to
+placement handle identifier 0.
+
Metadata
--------
@@ -232,6 +278,94 @@ The virtual namespace device supports DIF- and DIX-based protection information
``pil=UINT8`` (default: ``0``)
Controls the location of the protection information within the metadata. Set
- to ``1`` to transfer protection information as the first eight bytes of
- metadata. Otherwise, the protection information is transferred as the last
- eight bytes.
+ to ``1`` to transfer protection information as the first bytes of metadata.
+ Otherwise, the protection information is transferred as the last bytes of
+ metadata.
+
+``pif=UINT8`` (default: ``0``)
+ By default, the namespace device uses 16 bit guard protection information
+ format (``pif=0``). Set to ``2`` to enable 64 bit guard protection
+ information format. This requires at least 16 bytes of metadata. Note that
+ ``pif=1`` (32 bit guards) is currently not supported.
+
+Virtualization Enhancements and SR-IOV (Experimental Support)
+-------------------------------------------------------------
+
+The ``nvme`` device supports Single Root I/O Virtualization and Sharing
+along with Virtualization Enhancements. The controller has to be linked to
+an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV.
+
+A number of parameters are present (**please note, that they may be
+subject to change**):
+
+``sriov_max_vfs`` (default: ``0``)
+ Indicates the maximum number of PCIe virtual functions supported
+ by the controller. Specifying a non-zero value enables reporting of both
+ SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities
+ by the NVMe device. Virtual function controllers will not report SR-IOV.
+
+``sriov_vq_flexible``
+ Indicates the total number of flexible queue resources assignable to all
+ the secondary controllers. Implicitly sets the number of primary
+ controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``.
+
+``sriov_vi_flexible``
+ Indicates the total number of flexible interrupt resources assignable to
+ all the secondary controllers. Implicitly sets the number of primary
+ controller's private resources to ``(msix_qsize - sriov_vi_flexible)``.
+
+``sriov_max_vi_per_vf`` (default: ``0``)
+ Indicates the maximum number of virtual interrupt resources assignable
+ to a secondary controller. The default ``0`` resolves to
+ ``(sriov_vi_flexible / sriov_max_vfs)``
+
+``sriov_max_vq_per_vf`` (default: ``0``)
+ Indicates the maximum number of virtual queue resources assignable to
+ a secondary controller. The default ``0`` resolves to
+ ``(sriov_vq_flexible / sriov_max_vfs)``
+
+The simplest possible invocation enables the capability to set up one VF
+controller and assign an admin queue, an IO queue, and a MSI-X interrupt.
+
+.. code-block:: console
+
+ -device nvme-subsys,id=subsys0
+ -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1,
+ sriov_vq_flexible=2,sriov_vi_flexible=1
+
+The minimum steps required to configure a functional NVMe secondary
+controller are:
+
+ * unbind flexible resources from the primary controller
+
+.. code-block:: console
+
+ nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0
+ nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0
+
+ * perform a Function Level Reset on the primary controller to actually
+ release the resources
+
+.. code-block:: console
+
+ echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset
+
+ * enable VF
+
+.. code-block:: console
+
+ echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs
+
+ * assign the flexible resources to the VF and set it ONLINE
+
+.. code-block:: console
+
+ nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1
+ nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2
+ nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0
+
+ * bind the NVMe driver to the VF
+
+.. code-block:: console
+
+ echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind
diff --git a/docs/system/devices/usb-u2f.rst b/docs/system/devices/usb-u2f.rst
new file mode 100644
index 0000000000..4f57d5c8c3
--- /dev/null
+++ b/docs/system/devices/usb-u2f.rst
@@ -0,0 +1,93 @@
+Universal Second Factor (U2F) USB Key Device
+============================================
+
+U2F is an open authentication standard that enables relying parties
+exposed to the internet to offer a strong second factor option for end
+user authentication.
+
+The second factor is provided by a device implementing the U2F
+protocol. In case of a USB U2F security key, it is a USB HID device
+that implements the U2F protocol.
+
+QEMU supports both pass-through of a host U2F key device to a VM,
+and software emulation of a U2F key.
+
+``u2f-passthru``
+----------------
+
+The ``u2f-passthru`` device allows you to connect a real hardware
+U2F key on your host to a guest VM. All requests made from the guest
+are passed through to the physical security key connected to the
+host machine and vice versa.
+
+In addition, the dedicated pass-through allows you to share a single
+U2F security key with several guest VMs, which is not possible with a
+simple host device assignment pass-through.
+
+You can specify the host U2F key to use with the ``hidraw``
+option, which takes the host path to a Linux ``/dev/hidrawN`` device:
+
+.. parsed-literal::
+ |qemu_system| -usb -device u2f-passthru,hidraw=/dev/hidraw0
+
+If you don't specify the device, the ``u2f-passthru`` device will
+autoscan to take the first U2F device it finds on the host (this
+requires a working libudev):
+
+.. parsed-literal::
+ |qemu_system| -usb -device u2f-passthru
+
+``u2f-emulated``
+----------------
+
+``u2f-emulated`` is a completely software emulated U2F device.
+It uses `libu2f-emu <https://github.com/MattGorko/libu2f-emu>`__
+for the U2F key emulation. libu2f-emu
+provides a complete implementation of the U2F protocol device part for
+all specified transports given by the FIDO Alliance.
+
+To work, an emulated U2F device must have four elements:
+
+ * ec x509 certificate
+ * ec private key
+ * counter (four bytes value)
+ * 48 bytes of entropy (random bits)
+
+To use this type of device, these have to be configured, and these
+four elements must be passed one way or another.
+
+Assuming that you have a working libu2f-emu installed on the host,
+there are three possible ways to configure the ``u2f-emulated`` device:
+
+ * ephemeral
+ * setup directory
+ * manual
+
+Ephemeral is the simplest way to configure; it lets the device generate
+all the elements it needs for a single use of the lifetime of the device.
+It is the default if you do not pass any other options to the device.
+
+.. parsed-literal::
+ |qemu_system| -usb -device u2f-emulated
+
+You can pass the device the path of a setup directory on the host
+using the ``dir`` option; the directory must contain these four files:
+
+ * ``certificate.pem``: ec x509 certificate
+ * ``private-key.pem``: ec private key
+ * ``counter``: counter value
+ * ``entropy``: 48 bytes of entropy
+
+.. parsed-literal::
+ |qemu_system| -usb -device u2f-emulated,dir=$dir
+
+You can also manually pass the device the paths to each of these files,
+if you don't want them all to be in the same directory, using the options
+
+ * ``cert``
+ * ``priv``
+ * ``counter``
+ * ``entropy``
+
+.. parsed-literal::
+ |qemu_system| -usb -device u2f-emulated,cert=$DIR1/$FILE1,priv=$DIR2/$FILE2,counter=$DIR3/$FILE3,entropy=$DIR4/$FILE4
diff --git a/docs/system/devices/usb.rst b/docs/system/devices/usb.rst
index afb7d6c226..a6ca7b0c37 100644
--- a/docs/system/devices/usb.rst
+++ b/docs/system/devices/usb.rst
@@ -1,5 +1,3 @@
-.. _pcsys_005fusb:
-
USB emulation
-------------
@@ -178,8 +176,20 @@ option or the ``device_add`` monitor command. Available devices are:
host character device id.
``usb-braille,chardev=id``
- Braille device. This will use BrlAPI to display the braille output on
- a real or fake device referenced by id.
+ Braille device. This emulates a Baum Braille device USB port. id has to
+ specify a character device defined with ``-chardev …,id=id``. One will
+ normally use BrlAPI to display the braille output on a BRLTTY-supported
+ device with
+
+ .. parsed-literal::
+
+ |qemu_system| [...] -chardev braille,id=brl -device usb-braille,chardev=brl
+
+ or alternatively, use the following equivalent shortcut:
+
+ .. parsed-literal::
+
+ |qemu_system| [...] -usbdevice braille
``usb-net[,netdev=id]``
Network adapter that supports CDC ethernet and RNDIS protocols. id
@@ -197,7 +207,11 @@ option or the ``device_add`` monitor command. Available devices are:
USB audio device
``u2f-{emulated,passthru}``
- Universal Second Factor device
+ :doc:`usb-u2f`
+
+``canokey``
+ An Open-source Secure Key implementing FIDO2, OpenPGP, PIV and more.
+ For more information, see :ref:`canokey`.
Physical port addressing
^^^^^^^^^^^^^^^^^^^^^^^^
@@ -349,3 +363,44 @@ and also assign it to the correct USB bus in QEMU like this:
-device usb-ehci,id=ehci \\
-device usb-host,bus=usb-bus.0,hostbus=3,hostport=1 \\
-device usb-host,bus=ehci.0,hostbus=1,hostport=1
+
+``usb-host`` properties for reset behavior
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The ``guest-reset`` and ``guest-reset-all`` properties control
+whenever the guest is allowed to reset the physical usb device on the
+host. There are three cases:
+
+``guest-reset=false``
+ The guest is not allowed to reset the (physical) usb device.
+
+``guest-reset=true,guest-resets-all=false``
+ The guest is allowed to reset the device when it is not yet
+ initialized (aka no usb bus address assigned). Usually this results
+ in one guest reset being allowed. This is the default behavior.
+
+``guest-reset=true,guest-resets-all=true``
+ The guest is allowed to reset the device as it pleases.
+
+The reason for this existing are broken usb devices. In theory one
+should be able to reset (and re-initialize) usb devices at any time.
+In practice that may result in shitty usb device firmware crashing and
+the device not responding any more until you power-cycle (aka un-plug
+and re-plug) it.
+
+What works best pretty much depends on the behavior of the specific
+usb device at hand, so it's a trial-and-error game. If the default
+doesn't work, try another option and see whenever the situation
+improves.
+
+record usb transfers
+^^^^^^^^^^^^^^^^^^^^
+
+All usb devices have support for recording the usb traffic. This can
+be enabled using the ``pcap=<file>`` property, for example:
+
+``-device usb-mouse,pcap=mouse.pcap``
+
+The pcap files are compatible with the linux kernels usbmon. Many
+tools, including ``wireshark``, can decode and inspect these trace
+files.
diff --git a/docs/system/devices/vhost-user-input.rst b/docs/system/devices/vhost-user-input.rst
new file mode 100644
index 0000000000..118eb78101
--- /dev/null
+++ b/docs/system/devices/vhost-user-input.rst
@@ -0,0 +1,45 @@
+.. _vhost_user_input:
+
+QEMU vhost-user-input - Input emulation
+=======================================
+
+This document describes the setup and usage of the Virtio input device.
+The Virtio input device is a paravirtualized device for input events.
+
+Description
+-----------
+
+The vhost-user-input device implementation was designed to work with a daemon
+polling on input devices and passes input events to the guest.
+
+QEMU provides a backend implementation in contrib/vhost-user-input.
+
+Linux kernel support
+--------------------
+
+Virtio input requires a guest Linux kernel built with the
+``CONFIG_VIRTIO_INPUT`` option.
+
+Examples
+--------
+
+The backend daemon should be started first:
+
+::
+
+ host# vhost-user-input --socket-path=input.sock \
+ --evdev-path=/dev/input/event17
+
+The QEMU invocation needs to create a chardev socket to communicate with the
+backend daemon and access the VirtIO queues with the guest over the
+:ref:`shared memory <shared_memory_object>`.
+
+::
+
+ host# qemu-system \
+ -chardev socket,path=/tmp/input.sock,id=mouse0 \
+ -device vhost-user-input-pci,chardev=mouse0 \
+ -m 4096 \
+ -object memory-backend-file,id=mem,size=4G,mem-path=/dev/shm,share=on \
+ -numa node,memdev=mem \
+ ...
diff --git a/docs/system/devices/vhost-user-rng.rst b/docs/system/devices/vhost-user-rng.rst
new file mode 100644
index 0000000000..ead1405326
--- /dev/null
+++ b/docs/system/devices/vhost-user-rng.rst
@@ -0,0 +1,41 @@
+.. _vhost_user_rng:
+
+QEMU vhost-user-rng - RNG emulation
+===================================
+
+Background
+----------
+
+What follows builds on the material presented in vhost-user.rst - it should
+be reviewed before moving forward with the content in this file.
+
+Description
+-----------
+
+The vhost-user-rng device implementation was designed to work with a random
+number generator daemon such as the one found in the vhost-device crate of
+the rust-vmm project available on github [1].
+
+[1]. https://github.com/rust-vmm/vhost-device
+
+Examples
+--------
+
+The daemon should be started first:
+
+::
+
+ host# vhost-device-rng --socket-path=rng.sock -c 1 -m 512 -p 1000
+
+The QEMU invocation needs to create a chardev socket the device can
+use to communicate as well as share the guests memory over a memfd.
+
+::
+
+ host# qemu-system \
+ -chardev socket,path=$(PATH)/rng.sock,id=rng0 \
+ -device vhost-user-rng-pci,chardev=rng0 \
+ -m 4096 \
+ -object memory-backend-file,id=mem,size=4G,mem-path=/dev/shm,share=on \
+ -numa node,memdev=mem \
+ ...
diff --git a/docs/system/devices/vhost-user.rst b/docs/system/devices/vhost-user.rst
index 86128114fa..9b2da106ce 100644
--- a/docs/system/devices/vhost-user.rst
+++ b/docs/system/devices/vhost-user.rst
@@ -8,13 +8,81 @@ outside of QEMU itself. To do this there are a number of things
required.
vhost-user device
-===================
+=================
These are simple stub devices that ensure the VirtIO device is visible
to the guest. The code is mostly boilerplate although each device has
a ``chardev`` option which specifies the ID of the ``--chardev``
device that connects via a socket to the vhost-user *daemon*.
+Each device will have an virtio-mmio and virtio-pci variant. See your
+platform details for what sort of virtio bus to use.
+
+.. list-table:: vhost-user devices
+ :widths: 20 20 60
+ :header-rows: 1
+
+ * - Device
+ - Type
+ - Notes
+ * - vhost-user-blk
+ - Block storage
+ - See contrib/vhost-user-blk
+ * - vhost-user-fs
+ - File based storage driver
+ - See https://gitlab.com/virtio-fs/virtiofsd
+ * - vhost-user-gpio
+ - Proxy gpio pins to host
+ - See https://github.com/rust-vmm/vhost-device
+ * - vhost-user-gpu
+ - GPU driver
+ - See contrib/vhost-user-gpu
+ * - vhost-user-i2c
+ - Proxy i2c devices to host
+ - See https://github.com/rust-vmm/vhost-device
+ * - vhost-user-input
+ - Generic input driver
+ - :ref:`vhost_user_input`
+ * - vhost-user-rng
+ - Entropy driver
+ - :ref:`vhost_user_rng`
+ * - vhost-user-scmi
+ - System Control and Management Interface
+ - See https://github.com/rust-vmm/vhost-device
+ * - vhost-user-snd
+ - Audio device
+ - See https://github.com/rust-vmm/vhost-device/staging
+ * - vhost-user-scsi
+ - SCSI based storage
+ - See contrib/vhost-user-scsi
+ * - vhost-user-vsock
+ - Socket based communication
+ - See https://github.com/rust-vmm/vhost-device
+
+The referenced *daemons* are not exhaustive, any conforming backend
+implementing the device and using the vhost-user protocol should work.
+
+vhost-user-device
+^^^^^^^^^^^^^^^^^
+
+The vhost-user-device is a generic development device intended for
+expert use while developing new backends. The user needs to specify
+all the required parameters including:
+
+ - Device ``virtio-id``
+ - The ``num_vqs`` it needs and their ``vq_size``
+ - The ``config_size`` if needed
+
+.. note::
+ To prevent user confusion you cannot currently instantiate
+ vhost-user-device without first patching out::
+
+ /* Reason: stop inexperienced users confusing themselves */
+ dc->user_creatable = false;
+
+ in ``vhost-user-device.c`` and ``vhost-user-device-pci.c`` file and
+ rebuilding.
+
vhost-user daemon
=================
@@ -23,6 +91,8 @@ following the :ref:`vhost_user_proto`. There are a number of daemons
that can be built when enabled by the project although any daemon that
meets the specification for a given device can be used.
+.. _shared_memory_object:
+
Shared memory object
====================
@@ -38,13 +108,13 @@ system memory as defined by the ``-m`` argument.
Example
=======
-First start you daemon.
+First start your daemon.
.. parsed-literal::
$ virtio-foo --socket-path=/var/run/foo.sock $OTHER_ARGS
-The you start your QEMU instance specifying the device, chardev and
+Then you start your QEMU instance specifying the device, chardev and
memory objects.
.. parsed-literal::
diff --git a/docs/system/devices/virtio-gpu.rst b/docs/system/devices/virtio-gpu.rst
new file mode 100644
index 0000000000..cb73dd7998
--- /dev/null
+++ b/docs/system/devices/virtio-gpu.rst
@@ -0,0 +1,112 @@
+..
+ SPDX-License-Identifier: GPL-2.0-or-later
+
+virtio-gpu
+==========
+
+This document explains the setup and usage of the virtio-gpu device.
+The virtio-gpu device paravirtualizes the GPU and display controller.
+
+Linux kernel support
+--------------------
+
+virtio-gpu requires a guest Linux kernel built with the
+``CONFIG_DRM_VIRTIO_GPU`` option.
+
+QEMU virtio-gpu variants
+------------------------
+
+QEMU virtio-gpu device variants come in the following form:
+
+ * ``virtio-vga[-BACKEND]``
+ * ``virtio-gpu[-BACKEND][-INTERFACE]``
+ * ``vhost-user-vga``
+ * ``vhost-user-pci``
+
+**Backends:** QEMU provides a 2D virtio-gpu backend, and two accelerated
+backends: virglrenderer ('gl' device label) and rutabaga_gfx ('rutabaga'
+device label). There is a vhost-user backend that runs the graphics stack
+in a separate process for improved isolation.
+
+**Interfaces:** QEMU further categorizes virtio-gpu device variants based
+on the interface exposed to the guest. The interfaces can be classified
+into VGA and non-VGA variants. The VGA ones are prefixed with virtio-vga
+or vhost-user-vga while the non-VGA ones are prefixed with virtio-gpu or
+vhost-user-gpu.
+
+The VGA ones always use the PCI interface, but for the non-VGA ones, the
+user can further pick between MMIO or PCI. For MMIO, the user can suffix
+the device name with -device, though vhost-user-gpu does not support MMIO.
+For PCI, the user can suffix it with -pci. Without these suffixes, the
+platform default will be chosen.
+
+virtio-gpu 2d
+-------------
+
+The default 2D backend only performs 2D operations. The guest needs to
+employ a software renderer for 3D graphics.
+
+Typically, the software renderer is provided by `Mesa`_ or `SwiftShader`_.
+Mesa's implementations (LLVMpipe, Lavapipe and virgl below) work out of box
+on typical modern Linux distributions.
+
+.. parsed-literal::
+ -device virtio-gpu
+
+.. _Mesa: https://www.mesa3d.org/
+.. _SwiftShader: https://github.com/google/swiftshader
+
+virtio-gpu virglrenderer
+------------------------
+
+When using virgl accelerated graphics mode in the guest, OpenGL API calls
+are translated into an intermediate representation (see `Gallium3D`_). The
+intermediate representation is communicated to the host and the
+`virglrenderer`_ library on the host translates the intermediate
+representation back to OpenGL API calls.
+
+.. parsed-literal::
+ -device virtio-gpu-gl
+
+.. _Gallium3D: https://www.freedesktop.org/wiki/Software/gallium/
+.. _virglrenderer: https://gitlab.freedesktop.org/virgl/virglrenderer/
+
+virtio-gpu rutabaga
+-------------------
+
+virtio-gpu can also leverage rutabaga_gfx to provide `gfxstream`_
+rendering and `Wayland display passthrough`_. With the gfxstream rendering
+mode, GLES and Vulkan calls are forwarded to the host with minimal
+modification.
+
+The crosvm book provides directions on how to build a `gfxstream-enabled
+rutabaga`_ and launch a `guest Wayland proxy`_.
+
+This device does require host blob support (``hostmem`` field below). The
+``hostmem`` field specifies the size of virtio-gpu host memory window.
+This is typically between 256M and 8G.
+
+At least one virtio-gpu capability set ("capset") must be specified when
+starting the device. The currently capsets supported are ``gfxstream-vulkan``
+and ``cross-domain`` for Linux guests. For Android guests, the experimental
+``x-gfxstream-gles`` and ``x-gfxstream-composer`` capsets are also supported.
+
+The device will try to auto-detect the wayland socket path if the
+``cross-domain`` capset name is set. The user may optionally specify
+``wayland-socket-path`` for non-standard paths.
+
+The ``wsi`` option can be set to ``surfaceless`` or ``headless``.
+Surfaceless doesn't create a native window surface, but does copy from the
+render target to the Pixman buffer if a virtio-gpu 2D hypercall is issued.
+Headless is like surfaceless, but doesn't copy to the Pixman buffer.
+Surfaceless is the default if ``wsi`` is not specified.
+
+.. parsed-literal::
+ -device virtio-gpu-rutabaga,gfxstream-vulkan=on,cross-domain=on,
+ hostmem=8G,wayland-socket-path=/tmp/nonstandard/mock_wayland.sock,
+ wsi=headless
+
+.. _gfxstream: https://android.googlesource.com/platform/hardware/google/gfxstream/
+.. _Wayland display passthrough: https://www.youtube.com/watch?v=OZJiHMtIQ2M
+.. _gfxstream-enabled rutabaga: https://crosvm.dev/book/appendix/rutabaga_gfx.html
+.. _guest Wayland proxy: https://crosvm.dev/book/devices/wayland.html
diff --git a/docs/system/devices/virtio-snd.rst b/docs/system/devices/virtio-snd.rst
new file mode 100644
index 0000000000..2a9187fd70
--- /dev/null
+++ b/docs/system/devices/virtio-snd.rst
@@ -0,0 +1,49 @@
+virtio sound
+============
+
+This document explains the setup and usage of the Virtio sound device.
+The Virtio sound device is a paravirtualized sound card device.
+
+Linux kernel support
+--------------------
+
+Virtio sound requires a guest Linux kernel built with the
+``CONFIG_SND_VIRTIO`` option.
+
+Description
+-----------
+
+Virtio sound implements capture and playback from inside a guest using the
+configured audio backend of the host machine.
+
+Device properties
+-----------------
+
+The Virtio sound device can be configured with the following properties:
+
+ * ``jacks`` number of physical jacks (Unimplemented).
+ * ``streams`` number of PCM streams. At the moment, no stream configuration is supported: the first one will always be a playback stream, an optional second will always be a capture stream. Adding more will cycle stream directions from playback to capture.
+ * ``chmaps`` number of channel maps (Unimplemented).
+
+All streams are stereo and have the default channel positions ``Front left, right``.
+
+Examples
+--------
+
+Add an audio device and an audio backend at once with ``-audio`` and ``model=virtio``:
+
+ * pulseaudio: ``-audio driver=pa,model=virtio``
+ or ``-audio driver=pa,model=virtio,server=/run/user/1000/pulse/native``
+ * sdl: ``-audio driver=sdl,model=virtio``
+ * coreaudio: ``-audio driver=coreaudio,model=virtio``
+
+etc.
+
+To specifically add virtualized sound devices, you have to specify a PCI device
+and an audio backend listed with ``-audio driver=help`` that works on your host
+machine, e.g.:
+
+::
+
+ -device virtio-sound-pci,audiodev=my_audiodev \
+ -audiodev alsa,id=my_audiodev
diff --git a/docs/system/gdb.rst b/docs/system/gdb.rst
index bdb42dae2f..4228cb56bb 100644
--- a/docs/system/gdb.rst
+++ b/docs/system/gdb.rst
@@ -46,6 +46,39 @@ Here are some useful tips in order to use gdb on system code:
3. Use ``set architecture i8086`` to dump 16 bit code. Then use
``x/10i $cs*16+$eip`` to dump the code at the PC position.
+Breakpoint and Watchpoint support
+=================================
+
+While GDB can always fall back to inserting breakpoints into memory
+(if writable) other features are very much dependent on support of the
+accelerator. For TCG system emulation we advertise an infinite number
+of hardware assisted breakpoints and watchpoints. For other
+accelerators it will depend on if support has been added (see
+supports_guest_debug and related hooks in AccelOpsClass).
+
+As TCG cannot track all memory accesses in user-mode there is no
+support for watchpoints.
+
+Relocating code
+===============
+
+On modern kernels confusion can be caused by code being relocated by
+features such as address space layout randomisation. To avoid
+confusion when debugging such things you either need to update gdb's
+view of where things are in memory or perhaps more trivially disable
+ASLR when booting the system.
+
+Debugging user-space in system emulation
+========================================
+
+While it is technically possible to debug a user-space program running
+inside a system image, it does present challenges. Kernel preemption
+and execution mode changes between kernel and user mode can make it
+hard to follow what's going on. Unless you are specifically trying to
+debug some interaction between kernel and user-space you are better
+off running your guest program with gdb either in the guest or using
+a gdbserver exposed via a port to the outside world.
+
Debugging multicore machines
============================
@@ -56,7 +89,7 @@ machine has more than one CPU, QEMU exposes each CPU cluster as a
separate "inferior", where each CPU within the cluster is a separate
"thread". Most QEMU machine types have identical CPUs, so there is a
single cluster which has all the CPUs in it. A few machine types are
-heterogenous and have multiple clusters: for example the ``sifive_u``
+heterogeneous and have multiple clusters: for example the ``sifive_u``
machine has a cluster with one E51 core and a second cluster with four
U54 cores. Here the E51 is the only thread in the first inferior, and
the U54 cores are all threads in the second inferior.
@@ -192,3 +225,18 @@ The memory mode can be checked by sending the following command:
``maintenance packet Qqemu.PhyMemMode:0``
This will change it back to normal memory mode.
+
+Security considerations
+=======================
+
+Connecting to the GDB socket allows running arbitrary code inside the guest;
+in case of the TCG emulation, which is not considered a security boundary, this
+also means running arbitrary code on the host. Additionally, when debugging
+qemu-user, it allows directly downloading any file readable by QEMU from the
+host.
+
+The GDB socket is not protected by authentication, authorization or encryption.
+It is therefore a responsibility of the user to make sure that only authorized
+clients can connect to it, e.g., by using a unix socket with proper
+permissions, or by opening a TCP socket only on interfaces that are not
+reachable by potential attackers.
diff --git a/docs/system/guest-loader.rst b/docs/system/guest-loader.rst
index 4320d1183f..304ee5d531 100644
--- a/docs/system/guest-loader.rst
+++ b/docs/system/guest-loader.rst
@@ -14,7 +14,7 @@ The guest loader does two things:
- load blobs (kernels and initial ram disks) into memory
- sets platform FDT data so hypervisors can find and boot them
-This is what is typically done by a boot-loader like grub using it's
+This is what is typically done by a boot-loader like grub using its
multi-boot capability. A typical example would look like:
.. parsed-literal::
@@ -25,9 +25,9 @@ multi-boot capability. A typical example would look like:
-device guest-loader,addr=0x47000000,initrd=rootfs.cpio
In the above example the Xen hypervisor is loaded by the -kernel
-parameter and passed it's boot arguments via -append. The Dom0 guest
+parameter and passed its boot arguments via -append. The Dom0 guest
is loaded into the areas of memory. Each blob will get
-``/chosen/module@<addr>`` entry in the FDT to indicate it's location and
+``/chosen/module@<addr>`` entry in the FDT to indicate its location and
size. Additional information can be passed with by using additional
arguments.
@@ -51,4 +51,4 @@ The full syntax of the guest-loader is::
``bootargs=<args>``
This is an optional field for kernel blobs which will pass command
- like via the `/chosen/module@<addr>/bootargs` node.
+ like via the ``/chosen/module@<addr>/bootargs`` node.
diff --git a/docs/system/i386/amd-memory-encryption.rst b/docs/system/i386/amd-memory-encryption.rst
new file mode 100644
index 0000000000..e9bc142bc1
--- /dev/null
+++ b/docs/system/i386/amd-memory-encryption.rst
@@ -0,0 +1,206 @@
+AMD Secure Encrypted Virtualization (SEV)
+=========================================
+
+Secure Encrypted Virtualization (SEV) is a feature found on AMD processors.
+
+SEV is an extension to the AMD-V architecture which supports running encrypted
+virtual machines (VMs) under the control of KVM. Encrypted VMs have their pages
+(code and data) secured such that only the guest itself has access to the
+unencrypted version. Each encrypted VM is associated with a unique encryption
+key; if its data is accessed by a different entity using a different key the
+encrypted guests data will be incorrectly decrypted, leading to unintelligible
+data.
+
+Key management for this feature is handled by a separate processor known as the
+AMD secure processor (AMD-SP), which is present in AMD SOCs. Firmware running
+inside the AMD-SP provides commands to support a common VM lifecycle. This
+includes commands for launching, snapshotting, migrating and debugging the
+encrypted guest. These SEV commands can be issued via KVM_MEMORY_ENCRYPT_OP
+ioctls.
+
+Secure Encrypted Virtualization - Encrypted State (SEV-ES) builds on the SEV
+support to additionally protect the guest register state. In order to allow a
+hypervisor to perform functions on behalf of a guest, there is architectural
+support for notifying a guest's operating system when certain types of VMEXITs
+are about to occur. This allows the guest to selectively share information with
+the hypervisor to satisfy the requested function.
+
+Launching
+---------
+
+Boot images (such as bios) must be encrypted before a guest can be booted. The
+``MEMORY_ENCRYPT_OP`` ioctl provides commands to encrypt the images: ``LAUNCH_START``,
+``LAUNCH_UPDATE_DATA``, ``LAUNCH_MEASURE`` and ``LAUNCH_FINISH``. These four commands
+together generate a fresh memory encryption key for the VM, encrypt the boot
+images and provide a measurement than can be used as an attestation of a
+successful launch.
+
+For a SEV-ES guest, the ``LAUNCH_UPDATE_VMSA`` command is also used to encrypt the
+guest register state, or VM save area (VMSA), for all of the guest vCPUs.
+
+``LAUNCH_START`` is called first to create a cryptographic launch context within
+the firmware. To create this context, guest owner must provide a guest policy,
+its public Diffie-Hellman key (PDH) and session parameters. These inputs
+should be treated as a binary blob and must be passed as-is to the SEV firmware.
+
+The guest policy is passed as plaintext. A hypervisor may choose to read it,
+but should not modify it (any modification of the policy bits will result
+in bad measurement). The guest policy is a 4-byte data structure containing
+several flags that restricts what can be done on a running SEV guest.
+See SEV API Spec ([SEVAPI]_) section 3 and 6.2 for more details.
+
+The guest policy can be provided via the ``policy`` property::
+
+ # ${QEMU} \
+ sev-guest,id=sev0,policy=0x1...\
+
+Setting the "SEV-ES required" policy bit (bit 2) will launch the guest as a
+SEV-ES guest::
+
+ # ${QEMU} \
+ sev-guest,id=sev0,policy=0x5...\
+
+The guest owner provided DH certificate and session parameters will be used to
+establish a cryptographic session with the guest owner to negotiate keys used
+for the attestation.
+
+The DH certificate and session blob can be provided via the ``dh-cert-file`` and
+``session-file`` properties::
+
+ # ${QEMU} \
+ sev-guest,id=sev0,dh-cert-file=<file1>,session-file=<file2>
+
+``LAUNCH_UPDATE_DATA`` encrypts the memory region using the cryptographic context
+created via the ``LAUNCH_START`` command. If required, this command can be called
+multiple times to encrypt different memory regions. The command also calculates
+the measurement of the memory contents as it encrypts.
+
+``LAUNCH_UPDATE_VMSA`` encrypts all the vCPU VMSAs for a SEV-ES guest using the
+cryptographic context created via the ``LAUNCH_START`` command. The command also
+calculates the measurement of the VMSAs as it encrypts them.
+
+``LAUNCH_MEASURE`` can be used to retrieve the measurement of encrypted memory and,
+for a SEV-ES guest, encrypted VMSAs. This measurement is a signature of the
+memory contents and, for a SEV-ES guest, the VMSA contents, that can be sent
+to the guest owner as an attestation that the memory and VMSAs were encrypted
+correctly by the firmware. The guest owner may wait to provide the guest
+confidential information until it can verify the attestation measurement.
+Since the guest owner knows the initial contents of the guest at boot, the
+attestation measurement can be verified by comparing it to what the guest owner
+expects.
+
+``LAUNCH_FINISH`` finalizes the guest launch and destroys the cryptographic
+context.
+
+See SEV API Spec ([SEVAPI]_) 'Launching a guest' usage flow (Appendix A) for the
+complete flow chart.
+
+To launch a SEV guest::
+
+ # ${QEMU} \
+ -machine ...,confidential-guest-support=sev0 \
+ -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1
+
+To launch a SEV-ES guest::
+
+ # ${QEMU} \
+ -machine ...,confidential-guest-support=sev0 \
+ -object sev-guest,id=sev0,cbitpos=47,reduced-phys-bits=1,policy=0x5
+
+An SEV-ES guest has some restrictions as compared to a SEV guest. Because the
+guest register state is encrypted and cannot be updated by the VMM/hypervisor,
+a SEV-ES guest:
+
+ - Does not support SMM - SMM support requires updating the guest register
+ state.
+ - Does not support reboot - a system reset requires updating the guest register
+ state.
+ - Requires in-kernel irqchip - the burden is placed on the hypervisor to
+ manage booting APs.
+
+Calculating expected guest launch measurement
+---------------------------------------------
+
+In order to verify the guest launch measurement, The Guest Owner must compute
+it in the exact same way as it is calculated by the AMD-SP. SEV API Spec
+([SEVAPI]_) section 6.5.1 describes the AMD-SP operations:
+
+ GCTX.LD is finalized, producing the hash digest of all plaintext data
+ imported into the guest.
+
+ The launch measurement is calculated as:
+
+ HMAC(0x04 || API_MAJOR || API_MINOR || BUILD || GCTX.POLICY || GCTX.LD || MNONCE; GCTX.TIK)
+
+ where "||" represents concatenation.
+
+The values of API_MAJOR, API_MINOR, BUILD, and GCTX.POLICY can be obtained
+from the ``query-sev`` qmp command.
+
+The value of MNONCE is part of the response of ``query-sev-launch-measure``: it
+is the last 16 bytes of the base64-decoded data field (see SEV API Spec
+([SEVAPI]_) section 6.5.2 Table 52: LAUNCH_MEASURE Measurement Buffer).
+
+The value of GCTX.LD is
+``SHA256(firmware_blob || kernel_hashes_blob || vmsas_blob)``, where:
+
+* ``firmware_blob`` is the content of the entire firmware flash file (for
+ example, ``OVMF.fd``). Note that you must build a stateless firmware file
+ which doesn't use an NVRAM store, because the NVRAM area is not measured, and
+ therefore it is not secure to use a firmware which uses state from an NVRAM
+ store.
+* if kernel is used, and ``kernel-hashes=on``, then ``kernel_hashes_blob`` is
+ the content of PaddedSevHashTable (including the zero padding), which itself
+ includes the hashes of kernel, initrd, and cmdline that are passed to the
+ guest. The PaddedSevHashTable struct is defined in ``target/i386/sev.c``.
+* if SEV-ES is enabled (``policy & 0x4 != 0``), ``vmsas_blob`` is the
+ concatenation of all VMSAs of the guest vcpus. Each VMSA is 4096 bytes long;
+ its content is defined inside Linux kernel code as ``struct vmcb_save_area``,
+ or in AMD APM Volume 2 ([APMVOL2]_) Table B-2: VMCB Layout, State Save Area.
+
+If kernel hashes are not used, or SEV-ES is disabled, use empty blobs for
+``kernel_hashes_blob`` and ``vmsas_blob`` as needed.
+
+Debugging
+---------
+
+Since the memory contents of a SEV guest are encrypted, hypervisor access to
+the guest memory will return cipher text. If the guest policy allows debugging,
+then a hypervisor can use the DEBUG_DECRYPT and DEBUG_ENCRYPT commands to access
+the guest memory region for debug purposes. This is not supported in QEMU yet.
+
+Snapshot/Restore
+----------------
+
+TODO
+
+Live Migration
+---------------
+
+TODO
+
+References
+----------
+
+`AMD Memory Encryption whitepaper
+<https://www.amd.com/content/dam/amd/en/documents/epyc-business-docs/white-papers/memory-encryption-white-paper.pdf>`_
+
+.. [SEVAPI] `Secure Encrypted Virtualization API
+ <https://www.amd.com/system/files/TechDocs/55766_SEV-KM_API_Specification.pdf>`_
+
+.. [APMVOL2] `AMD64 Architecture Programmer's Manual Volume 2: System Programming
+ <https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_
+
+KVM Forum slides:
+
+* `AMD’s Virtualization Memory Encryption (2016)
+ <http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf>`_
+* `Extending Secure Encrypted Virtualization With SEV-ES (2018)
+ <https://www.linux-kvm.org/images/9/94/Extending-Secure-Encrypted-Virtualization-with-SEV-ES-Thomas-Lendacky-AMD.pdf>`_
+
+`AMD64 Architecture Programmer's Manual:
+<https://www.amd.com/content/dam/amd/en/documents/processor-tech-docs/programmer-references/24593.pdf>`_
+
+* SME is section 7.10
+* SEV is section 15.34
+* SEV-ES is section 15.35
diff --git a/docs/system/i386/hyperv.rst b/docs/system/i386/hyperv.rst
new file mode 100644
index 0000000000..2505dc4c86
--- /dev/null
+++ b/docs/system/i386/hyperv.rst
@@ -0,0 +1,288 @@
+Hyper-V Enlightenments
+======================
+
+
+Description
+-----------
+
+In some cases when implementing a hardware interface in software is slow, KVM
+implements its own paravirtualized interfaces. This works well for Linux as
+guest support for such features is added simultaneously with the feature itself.
+It may, however, be hard-to-impossible to add support for these interfaces to
+proprietary OSes, namely, Microsoft Windows.
+
+KVM on x86 implements Hyper-V Enlightenments for Windows guests. These features
+make Windows and Hyper-V guests think they're running on top of a Hyper-V
+compatible hypervisor and use Hyper-V specific features.
+
+
+Setup
+-----
+
+No Hyper-V enlightenments are enabled by default by either KVM or QEMU. In
+QEMU, individual enlightenments can be enabled through CPU flags, e.g:
+
+.. parsed-literal::
+
+ |qemu_system| --enable-kvm --cpu host,hv_relaxed,hv_vpindex,hv_time, ...
+
+Sometimes there are dependencies between enlightenments, QEMU is supposed to
+check that the supplied configuration is sane.
+
+When any set of the Hyper-V enlightenments is enabled, QEMU changes hypervisor
+identification (CPUID 0x40000000..0x4000000A) to Hyper-V. KVM identification
+and features are kept in leaves 0x40000100..0x40000101.
+
+
+Existing enlightenments
+-----------------------
+
+``hv-relaxed``
+ This feature tells guest OS to disable watchdog timeouts as it is running on a
+ hypervisor. It is known that some Windows versions will do this even when they
+ see 'hypervisor' CPU flag.
+
+``hv-vapic``
+ Provides so-called VP Assist page MSR to guest allowing it to work with APIC
+ more efficiently. In particular, this enlightenment allows paravirtualized
+ (exit-less) EOI processing.
+
+``hv-spinlocks`` = xxx
+ Enables paravirtualized spinlocks. The parameter indicates how many times
+ spinlock acquisition should be attempted before indicating the situation to the
+ hypervisor. A special value 0xffffffff indicates "never notify".
+
+``hv-vpindex``
+ Provides HV_X64_MSR_VP_INDEX (0x40000002) MSR to the guest which has Virtual
+ processor index information. This enlightenment makes sense in conjunction with
+ hv-synic, hv-stimer and other enlightenments which require the guest to know its
+ Virtual Processor indices (e.g. when VP index needs to be passed in a
+ hypercall).
+
+``hv-runtime``
+ Provides HV_X64_MSR_VP_RUNTIME (0x40000010) MSR to the guest. The MSR keeps the
+ virtual processor run time in 100ns units. This gives guest operating system an
+ idea of how much time was 'stolen' from it (when the virtual CPU was preempted
+ to perform some other work).
+
+``hv-crash``
+ Provides HV_X64_MSR_CRASH_P0..HV_X64_MSR_CRASH_P5 (0x40000100..0x40000105) and
+ HV_X64_MSR_CRASH_CTL (0x40000105) MSRs to the guest. These MSRs are written to
+ by the guest when it crashes, HV_X64_MSR_CRASH_P0..HV_X64_MSR_CRASH_P5 MSRs
+ contain additional crash information. This information is outputted in QEMU log
+ and through QAPI.
+ Note: unlike under genuine Hyper-V, write to HV_X64_MSR_CRASH_CTL causes guest
+ to shutdown. This effectively blocks crash dump generation by Windows.
+
+``hv-time``
+ Enables two Hyper-V-specific clocksources available to the guest: MSR-based
+ Hyper-V clocksource (HV_X64_MSR_TIME_REF_COUNT, 0x40000020) and Reference TSC
+ page (enabled via MSR HV_X64_MSR_REFERENCE_TSC, 0x40000021). Both clocksources
+ are per-guest, Reference TSC page clocksource allows for exit-less time stamp
+ readings. Using this enlightenment leads to significant speedup of all timestamp
+ related operations.
+
+``hv-synic``
+ Enables Hyper-V Synthetic interrupt controller - an extension of a local APIC.
+ When enabled, this enlightenment provides additional communication facilities
+ to the guest: SynIC messages and Events. This is a pre-requisite for
+ implementing VMBus devices (not yet in QEMU). Additionally, this enlightenment
+ is needed to enable Hyper-V synthetic timers. SynIC is controlled through MSRs
+ HV_X64_MSR_SCONTROL..HV_X64_MSR_EOM (0x40000080..0x40000084) and
+ HV_X64_MSR_SINT0..HV_X64_MSR_SINT15 (0x40000090..0x4000009F)
+
+ Requires: ``hv-vpindex``
+
+``hv-stimer``
+ Enables Hyper-V synthetic timers. There are four synthetic timers per virtual
+ CPU controlled through HV_X64_MSR_STIMER0_CONFIG..HV_X64_MSR_STIMER3_COUNT
+ (0x400000B0..0x400000B7) MSRs. These timers can work either in single-shot or
+ periodic mode. It is known that certain Windows versions revert to using HPET
+ (or even RTC when HPET is unavailable) extensively when this enlightenment is
+ not provided; this can lead to significant CPU consumption, even when virtual
+ CPU is idle.
+
+ Requires: ``hv-vpindex``, ``hv-synic``, ``hv-time``
+
+``hv-tlbflush``
+ Enables paravirtualized TLB shoot-down mechanism. On x86 architecture, remote
+ TLB flush procedure requires sending IPIs and waiting for other CPUs to perform
+ local TLB flush. In virtualized environment some virtual CPUs may not even be
+ scheduled at the time of the call and may not require flushing (or, flushing
+ may be postponed until the virtual CPU is scheduled). hv-tlbflush enlightenment
+ implements TLB shoot-down through hypervisor enabling the optimization.
+
+ Requires: ``hv-vpindex``
+
+``hv-ipi``
+ Enables paravirtualized IPI send mechanism. HvCallSendSyntheticClusterIpi
+ hypercall may target more than 64 virtual CPUs simultaneously, doing the same
+ through APIC requires more than one access (and thus exit to the hypervisor).
+
+ Requires: ``hv-vpindex``
+
+``hv-vendor-id`` = xxx
+ This changes Hyper-V identification in CPUID 0x40000000.EBX-EDX from the default
+ "Microsoft Hv". The parameter should be no longer than 12 characters. According
+ to the specification, guests shouldn't use this information and it is unknown
+ if there is a Windows version which acts differently.
+ Note: hv-vendor-id is not an enlightenment and thus doesn't enable Hyper-V
+ identification when specified without some other enlightenment.
+
+``hv-reset``
+ Provides HV_X64_MSR_RESET (0x40000003) MSR to the guest allowing it to reset
+ itself by writing to it. Even when this MSR is enabled, it is not a recommended
+ way for Windows to perform system reboot and thus it may not be used.
+
+``hv-frequencies``
+ Provides HV_X64_MSR_TSC_FREQUENCY (0x40000022) and HV_X64_MSR_APIC_FREQUENCY
+ (0x40000023) allowing the guest to get its TSC/APIC frequencies without doing
+ measurements.
+
+``hv-reenlightenment``
+ The enlightenment is nested specific, it targets Hyper-V on KVM guests. When
+ enabled, it provides HV_X64_MSR_REENLIGHTENMENT_CONTROL (0x40000106),
+ HV_X64_MSR_TSC_EMULATION_CONTROL (0x40000107)and HV_X64_MSR_TSC_EMULATION_STATUS
+ (0x40000108) MSRs allowing the guest to get notified when TSC frequency changes
+ (only happens on migration) and keep using old frequency (through emulation in
+ the hypervisor) until it is ready to switch to the new one. This, in conjunction
+ with ``hv-frequencies``, allows Hyper-V on KVM to pass stable clocksource
+ (Reference TSC page) to its own guests.
+
+ Note, KVM doesn't fully support re-enlightenment notifications and doesn't
+ emulate TSC accesses after migration so 'tsc-frequency=' CPU option also has to
+ be specified to make migration succeed. The destination host has to either have
+ the same TSC frequency or support TSC scaling CPU feature.
+
+ Recommended: ``hv-frequencies``
+
+``hv-evmcs``
+ The enlightenment is nested specific, it targets Hyper-V on KVM guests. When
+ enabled, it provides Enlightened VMCS version 1 feature to the guest. The feature
+ implements paravirtualized protocol between L0 (KVM) and L1 (Hyper-V)
+ hypervisors making L2 exits to the hypervisor faster. The feature is Intel-only.
+
+ Note: some virtualization features (e.g. Posted Interrupts) are disabled when
+ hv-evmcs is enabled. It may make sense to measure your nested workload with and
+ without the feature to find out if enabling it is beneficial.
+
+ Requires: ``hv-vapic``
+
+``hv-stimer-direct``
+ Hyper-V specification allows synthetic timer operation in two modes: "classic",
+ when expiration event is delivered as SynIC message and "direct", when the event
+ is delivered via normal interrupt. It is known that nested Hyper-V can only
+ use synthetic timers in direct mode and thus ``hv-stimer-direct`` needs to be
+ enabled.
+
+ Requires: ``hv-vpindex``, ``hv-synic``, ``hv-time``, ``hv-stimer``
+
+``hv-avic`` (``hv-apicv``)
+ The enlightenment allows to use Hyper-V SynIC with hardware APICv/AVIC enabled.
+ Normally, Hyper-V SynIC disables these hardware feature and suggests the guest
+ to use paravirtualized AutoEOI feature.
+ Note: enabling this feature on old hardware (without APICv/AVIC support) may
+ have negative effect on guest's performance.
+
+``hv-no-nonarch-coresharing`` = on/off/auto
+ This enlightenment tells guest OS that virtual processors will never share a
+ physical core unless they are reported as sibling SMT threads. This information
+ is required by Windows and Hyper-V guests to properly mitigate SMT related CPU
+ vulnerabilities.
+
+ When the option is set to 'auto' QEMU will enable the feature only when KVM
+ reports that non-architectural coresharing is impossible, this means that
+ hyper-threading is not supported or completely disabled on the host. This
+ setting also prevents migration as SMT settings on the destination may differ.
+ When the option is set to 'on' QEMU will always enable the feature, regardless
+ of host setup. To keep guests secure, this can only be used in conjunction with
+ exposing correct vCPU topology and vCPU pinning.
+
+``hv-version-id-build``, ``hv-version-id-major``, ``hv-version-id-minor``, ``hv-version-id-spack``, ``hv-version-id-sbranch``, ``hv-version-id-snumber``
+ This changes Hyper-V version identification in CPUID 0x40000002.EAX-EDX from the
+ default (WS2016).
+
+ - ``hv-version-id-build`` sets 'Build Number' (32 bits)
+ - ``hv-version-id-major`` sets 'Major Version' (16 bits)
+ - ``hv-version-id-minor`` sets 'Minor Version' (16 bits)
+ - ``hv-version-id-spack`` sets 'Service Pack' (32 bits)
+ - ``hv-version-id-sbranch`` sets 'Service Branch' (8 bits)
+ - ``hv-version-id-snumber`` sets 'Service Number' (24 bits)
+
+ Note: hv-version-id-* are not enlightenments and thus don't enable Hyper-V
+ identification when specified without any other enlightenments.
+
+``hv-syndbg``
+ Enables Hyper-V synthetic debugger interface, this is a special interface used
+ by Windows Kernel debugger to send the packets through, rather than sending
+ them via serial/network .
+ When enabled, this enlightenment provides additional communication facilities
+ to the guest: SynDbg messages.
+ This new communication is used by Windows Kernel debugger rather than sending
+ packets via serial/network, adding significant performance boost over the other
+ comm channels.
+ This enlightenment requires a VMBus device (-device vmbus-bridge,irq=15).
+
+ Requires: ``hv-relaxed``, ``hv_time``, ``hv-vapic``, ``hv-vpindex``, ``hv-synic``, ``hv-runtime``, ``hv-stimer``
+
+``hv-emsr-bitmap``
+ The enlightenment is nested specific, it targets Hyper-V on KVM guests. When
+ enabled, it allows L0 (KVM) and L1 (Hyper-V) hypervisors to collaborate to
+ avoid unnecessary updates to L2 MSR-Bitmap upon vmexits. While the protocol is
+ supported for both VMX (Intel) and SVM (AMD), the VMX implementation requires
+ Enlightened VMCS (``hv-evmcs``) feature to also be enabled.
+
+ Recommended: ``hv-evmcs`` (Intel)
+
+``hv-xmm-input``
+ Hyper-V specification allows to pass parameters for certain hypercalls using XMM
+ registers ("XMM Fast Hypercall Input"). When the feature is in use, it allows
+ for faster hypercalls processing as KVM can avoid reading guest's memory.
+
+``hv-tlbflush-ext``
+ Allow for extended GVA ranges to be passed to Hyper-V TLB flush hypercalls
+ (HvFlushVirtualAddressList/HvFlushVirtualAddressListEx).
+
+ Requires: ``hv-tlbflush``
+
+``hv-tlbflush-direct``
+ The enlightenment is nested specific, it targets Hyper-V on KVM guests. When
+ enabled, it allows L0 (KVM) to directly handle TLB flush hypercalls from L2
+ guest without the need to exit to L1 (Hyper-V) hypervisor. While the feature is
+ supported for both VMX (Intel) and SVM (AMD), the VMX implementation requires
+ Enlightened VMCS (``hv-evmcs``) feature to also be enabled.
+
+ Requires: ``hv-vapic``
+
+ Recommended: ``hv-evmcs`` (Intel)
+
+Supplementary features
+----------------------
+
+``hv-passthrough``
+ In some cases (e.g. during development) it may make sense to use QEMU in
+ 'pass-through' mode and give Windows guests all enlightenments currently
+ supported by KVM. This pass-through mode is enabled by "hv-passthrough" CPU
+ flag.
+
+ Note: ``hv-passthrough`` flag only enables enlightenments which are known to QEMU
+ (have corresponding 'hv-' flag) and copies ``hv-spinlocks`` and ``hv-vendor-id``
+ values from KVM to QEMU. ``hv-passthrough`` overrides all other 'hv-' settings on
+ the command line. Also, enabling this flag effectively prevents migration as the
+ list of enabled enlightenments may differ between target and destination hosts.
+
+``hv-enforce-cpuid``
+ By default, KVM allows the guest to use all currently supported Hyper-V
+ enlightenments when Hyper-V CPUID interface was exposed, regardless of if
+ some features were not announced in guest visible CPUIDs. ``hv-enforce-cpuid``
+ feature alters this behavior and only allows the guest to use exposed Hyper-V
+ enlightenments.
+
+
+Useful links
+------------
+Hyper-V Top Level Functional specification and other information:
+
+- https://github.com/MicrosoftDocs/Virtualization-Documentation
+- https://docs.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/tlfs
+
diff --git a/docs/system/i386/kvm-pv.rst b/docs/system/i386/kvm-pv.rst
new file mode 100644
index 0000000000..1e5a9923ef
--- /dev/null
+++ b/docs/system/i386/kvm-pv.rst
@@ -0,0 +1,100 @@
+Paravirtualized KVM features
+============================
+
+Description
+-----------
+
+In some cases when implementing hardware interfaces in software is slow, ``KVM``
+implements its own paravirtualized interfaces.
+
+Setup
+-----
+
+Paravirtualized ``KVM`` features are represented as CPU flags. The following
+features are enabled by default for any CPU model when ``KVM`` acceleration is
+enabled:
+
+- ``kvmclock``
+- ``kvm-nopiodelay``
+- ``kvm-asyncpf``
+- ``kvm-steal-time``
+- ``kvm-pv-eoi``
+- ``kvmclock-stable-bit``
+
+``kvm-msi-ext-dest-id`` feature is enabled by default in x2apic mode with split
+irqchip (e.g. "-machine ...,kernel-irqchip=split -cpu ...,x2apic").
+
+Note: when CPU model ``host`` is used, QEMU passes through all supported
+paravirtualized ``KVM`` features to the guest.
+
+Existing features
+-----------------
+
+``kvmclock``
+ Expose a ``KVM`` specific paravirtualized clocksource to the guest. Supported
+ since Linux v2.6.26.
+
+``kvm-nopiodelay``
+ The guest doesn't need to perform delays on PIO operations. Supported since
+ Linux v2.6.26.
+
+``kvm-mmu``
+ This feature is deprecated.
+
+``kvm-asyncpf``
+ Enable asynchronous page fault mechanism. Supported since Linux v2.6.38.
+ Note: since Linux v5.10 the feature is deprecated and not enabled by ``KVM``.
+ Use ``kvm-asyncpf-int`` instead.
+
+``kvm-steal-time``
+ Enable stolen (when guest vCPU is not running) time accounting. Supported
+ since Linux v3.1.
+
+``kvm-pv-eoi``
+ Enable paravirtualized end-of-interrupt signaling. Supported since Linux
+ v3.10.
+
+``kvm-pv-unhalt``
+ Enable paravirtualized spinlocks support. Supported since Linux v3.12.
+
+``kvm-pv-tlb-flush``
+ Enable paravirtualized TLB flush mechanism. Supported since Linux v4.16.
+
+``kvm-pv-ipi``
+ Enable paravirtualized IPI mechanism. Supported since Linux v4.19.
+
+``kvm-poll-control``
+ Enable host-side polling on HLT control from the guest. Supported since Linux
+ v5.10.
+
+``kvm-pv-sched-yield``
+ Enable paravirtualized sched yield feature. Supported since Linux v5.10.
+
+``kvm-asyncpf-int``
+ Enable interrupt based asynchronous page fault mechanism. Supported since Linux
+ v5.10.
+
+``kvm-msi-ext-dest-id``
+ Support 'Extended Destination ID' for external interrupts. The feature allows
+ to use up to 32768 CPUs without IRQ remapping (but other limits may apply making
+ the number of supported vCPUs for a given configuration lower). Supported since
+ Linux v5.10.
+
+``kvmclock-stable-bit``
+ Tell the guest that guest visible TSC value can be fully trusted for kvmclock
+ computations and no warps are expected. Supported since Linux v2.6.35.
+
+Supplementary features
+----------------------
+
+``kvm-pv-enforce-cpuid``
+ Limit the supported paravirtualized feature set to the exposed features only.
+ Note, by default, ``KVM`` allows the guest to use all currently supported
+ paravirtualized features even when they were not announced in guest visible
+ CPUIDs. Supported since Linux v5.10.
+
+
+Useful links
+------------
+
+Please refer to Documentation/virt/kvm in Linux for additional details.
diff --git a/docs/system/i386/sgx.rst b/docs/system/i386/sgx.rst
new file mode 100644
index 0000000000..ab58b29392
--- /dev/null
+++ b/docs/system/i386/sgx.rst
@@ -0,0 +1,188 @@
+Software Guard eXtensions (SGX)
+===============================
+
+Overview
+--------
+
+Intel Software Guard eXtensions (SGX) is a set of instructions and mechanisms
+for memory accesses in order to provide security accesses for sensitive
+applications and data. SGX allows an application to use its particular
+address space as an *enclave*, which is a protected area provides confidentiality
+and integrity even in the presence of privileged malware. Accesses to the
+enclave memory area from any software not resident in the enclave are prevented,
+including those from privileged software.
+
+Virtual SGX
+-----------
+
+SGX feature is exposed to guest via SGX CPUID. Looking at SGX CPUID, we can
+report the same CPUID info to guest as on host for most of SGX CPUID. With
+reporting the same CPUID guest is able to use full capacity of SGX, and KVM
+doesn't need to emulate those info.
+
+The guest's EPC base and size are determined by QEMU, and KVM needs QEMU to
+notify such info to it before it can initialize SGX for guest.
+
+Virtual EPC
+~~~~~~~~~~~
+
+By default, QEMU does not assign EPC to a VM, i.e. fully enabling SGX in a VM
+requires explicit allocation of EPC to the VM. Similar to other specialized
+memory types, e.g. hugetlbfs, EPC is exposed as a memory backend.
+
+SGX EPC is enumerated through CPUID, i.e. EPC "devices" need to be realized
+prior to realizing the vCPUs themselves, which occurs long before generic
+devices are parsed and realized. This limitation means that EPC does not
+require -maxmem as EPC is not treated as {cold,hot}plugged memory.
+
+QEMU does not artificially restrict the number of EPC sections exposed to a
+guest, e.g. QEMU will happily allow you to create 64 1M EPC sections. Be aware
+that some kernels may not recognize all EPC sections, e.g. the Linux SGX driver
+is hardwired to support only 8 EPC sections.
+
+The following QEMU snippet creates two EPC sections, with 64M pre-allocated
+to the VM and an additional 28M mapped but not allocated::
+
+ -object memory-backend-epc,id=mem1,size=64M,prealloc=on \
+ -object memory-backend-epc,id=mem2,size=28M \
+ -M sgx-epc.0.memdev=mem1,sgx-epc.1.memdev=mem2
+
+Note:
+
+The size and location of the virtual EPC are far less restricted compared
+to physical EPC. Because physical EPC is protected via range registers,
+the size of the physical EPC must be a power of two (though software sees
+a subset of the full EPC, e.g. 92M or 128M) and the EPC must be naturally
+aligned. KVM SGX's virtual EPC is purely a software construct and only
+requires the size and location to be page aligned. QEMU enforces the EPC
+size is a multiple of 4k and will ensure the base of the EPC is 4k aligned.
+To simplify the implementation, EPC is always located above 4g in the guest
+physical address space.
+
+Migration
+~~~~~~~~~
+
+QEMU/KVM doesn't prevent live migrating SGX VMs, although from hardware's
+perspective, SGX doesn't support live migration, since both EPC and the SGX
+key hierarchy are bound to the physical platform. However live migration
+can be supported in the sense if guest software stack can support recreating
+enclaves when it suffers sudden lose of EPC; and if guest enclaves can detect
+SGX keys being changed, and handle gracefully. For instance, when ERESUME fails
+with #PF.SGX, guest software can gracefully detect it and recreate enclaves;
+and when enclave fails to unseal sensitive information from outside, it can
+detect such error and sensitive information can be provisioned to it again.
+
+CPUID
+~~~~~
+
+Due to its myriad dependencies, SGX is currently not listed as supported
+in any of QEMU's built-in CPU configuration. To expose SGX (and SGX Launch
+Control) to a guest, you must either use ``-cpu host`` to pass-through the
+host CPU model, or explicitly enable SGX when using a built-in CPU model,
+e.g. via ``-cpu <model>,+sgx`` or ``-cpu <model>,+sgx,+sgxlc``.
+
+All SGX sub-features enumerated through CPUID, e.g. SGX2, MISCSELECT,
+ATTRIBUTES, etc... can be restricted via CPUID flags. Be aware that enforcing
+restriction of MISCSELECT, ATTRIBUTES and XFRM requires intercepting ECREATE,
+i.e. may marginally reduce SGX performance in the guest. All SGX sub-features
+controlled via -cpu are prefixed with "sgx", e.g.::
+
+ $ qemu-system-x86_64 -cpu help | xargs printf "%s\n" | grep sgx
+ sgx
+ sgx-debug
+ sgx-encls-c
+ sgx-enclv
+ sgx-exinfo
+ sgx-kss
+ sgx-mode64
+ sgx-provisionkey
+ sgx-tokenkey
+ sgx1
+ sgx2
+ sgxlc
+
+The following QEMU snippet passes through the host CPU but restricts access to
+the provision and EINIT token keys::
+
+ -cpu host,-sgx-provisionkey,-sgx-tokenkey
+
+SGX sub-features cannot be emulated, i.e. sub-features that are not present
+in hardware cannot be forced on via '-cpu'.
+
+Virtualize SGX Launch Control
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+QEMU SGX support for Launch Control (LC) is passive, in the sense that it
+does not actively change the LC configuration. QEMU SGX provides the user
+the ability to set/clear the CPUID flag (and by extension the associated
+IA32_FEATURE_CONTROL MSR bit in fw_cfg) and saves/restores the LE Hash MSRs
+when getting/putting guest state, but QEMU does not add new controls to
+directly modify the LC configuration. Similar to hardware behavior, locking
+the LC configuration to a non-Intel value is left to guest firmware. Unlike
+host bios setting for SGX launch control(LC), there is no special bios setting
+for SGX guest by our design. If host is in locked mode, we can still allow
+creating VM with SGX.
+
+Feature Control
+~~~~~~~~~~~~~~~
+
+QEMU SGX updates the ``etc/msr_feature_control`` fw_cfg entry to set the SGX
+(bit 18) and SGX LC (bit 17) flags based on their respective CPUID support,
+i.e. existing guest firmware will automatically set SGX and SGX LC accordingly,
+assuming said firmware supports fw_cfg.msr_feature_control.
+
+Launching a guest
+-----------------
+
+To launch a SGX guest:
+
+.. parsed-literal::
+
+ |qemu_system_x86| \\
+ -cpu host,+sgx-provisionkey \\
+ -object memory-backend-epc,id=mem1,size=64M,prealloc=on \\
+ -M sgx-epc.0.memdev=mem1,sgx-epc.0.node=0
+
+Utilizing SGX in the guest requires a kernel/OS with SGX support.
+The support can be determined in guest by::
+
+ $ grep sgx /proc/cpuinfo
+
+and SGX epc info by::
+
+ $ dmesg | grep sgx
+ [ 0.182807] sgx: EPC section 0x140000000-0x143ffffff
+ [ 0.183695] sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0.
+
+To launch a SGX numa guest:
+
+.. parsed-literal::
+
+ |qemu_system_x86| \\
+ -cpu host,+sgx-provisionkey \\
+ -object memory-backend-ram,size=2G,host-nodes=0,policy=bind,id=node0 \\
+ -object memory-backend-epc,id=mem0,size=64M,prealloc=on,host-nodes=0,policy=bind \\
+ -numa node,nodeid=0,cpus=0-1,memdev=node0 \\
+ -object memory-backend-ram,size=2G,host-nodes=1,policy=bind,id=node1 \\
+ -object memory-backend-epc,id=mem1,size=28M,prealloc=on,host-nodes=1,policy=bind \\
+ -numa node,nodeid=1,cpus=2-3,memdev=node1 \\
+ -M sgx-epc.0.memdev=mem0,sgx-epc.0.node=0,sgx-epc.1.memdev=mem1,sgx-epc.1.node=1
+
+and SGX epc numa info by::
+
+ $ dmesg | grep sgx
+ [ 0.369937] sgx: EPC section 0x180000000-0x183ffffff
+ [ 0.370259] sgx: EPC section 0x184000000-0x185bfffff
+
+ $ dmesg | grep SRAT
+ [ 0.009981] ACPI: SRAT: Node 0 PXM 0 [mem 0x180000000-0x183ffffff]
+ [ 0.009982] ACPI: SRAT: Node 1 PXM 1 [mem 0x184000000-0x185bfffff]
+
+References
+----------
+
+- `SGX Homepage <https://software.intel.com/sgx>`__
+
+- `SGX SDK <https://github.com/intel/linux-sgx.git>`__
+
+- SGX specification: Intel SDM Volume 3
diff --git a/docs/system/i386/xen.rst b/docs/system/i386/xen.rst
new file mode 100644
index 0000000000..46db5f34c1
--- /dev/null
+++ b/docs/system/i386/xen.rst
@@ -0,0 +1,144 @@
+Xen HVM guest support
+=====================
+
+
+Description
+-----------
+
+KVM has support for hosting Xen guests, intercepting Xen hypercalls and event
+channel (Xen PV interrupt) delivery. This allows guests which expect to be
+run under Xen to be hosted in QEMU under Linux/KVM instead.
+
+Using the split irqchip is mandatory for Xen support.
+
+Setup
+-----
+
+Xen mode is enabled by setting the ``xen-version`` property of the KVM
+accelerator, for example for Xen 4.17:
+
+.. parsed-literal::
+
+ |qemu_system| --accel kvm,xen-version=0x40011,kernel-irqchip=split
+
+Additionally, virtual APIC support can be advertised to the guest through the
+``xen-vapic`` CPU flag:
+
+.. parsed-literal::
+
+ |qemu_system| --accel kvm,xen-version=0x40011,kernel-irqchip=split --cpu host,+xen-vapic
+
+When Xen support is enabled, QEMU changes hypervisor identification (CPUID
+0x40000000..0x4000000A) to Xen. The KVM identification and features are not
+advertised to a Xen guest. If Hyper-V is also enabled, the Xen identification
+moves to leaves 0x40000100..0x4000010A.
+
+Properties
+----------
+
+The following properties exist on the KVM accelerator object:
+
+``xen-version``
+ This property contains the Xen version in ``XENVER_version`` form, with the
+ major version in the top 16 bits and the minor version in the low 16 bits.
+ Setting this property enables the Xen guest support. If Xen version 4.5 or
+ greater is specified, the HVM leaf in Xen CPUID is populated. Xen version
+ 4.6 enables the vCPU ID in CPUID, and version 4.17 advertises vCPU upcall
+ vector support to the guest.
+
+``xen-evtchn-max-pirq``
+ Xen PIRQs represent an emulated physical interrupt, either GSI or MSI, which
+ can be routed to an event channel instead of to the emulated I/O or local
+ APIC. By default, QEMU permits only 256 PIRQs because this allows maximum
+ compatibility with 32-bit MSI where the higher bits of the PIRQ# would need
+ to be in the upper 64 bits of the MSI message. For guests with large numbers
+ of PCI devices (and none which are limited to 32-bit addressing) it may be
+ desirable to increase this value.
+
+``xen-gnttab-max-frames``
+ Xen grant tables are the means by which a Xen guest grants access to its
+ memory for PV back ends (disk, network, etc.). Since QEMU only supports v1
+ grant tables which are 8 bytes in size, each page (each frame) of the grant
+ table can reference 512 pages of guest memory. The default number of frames
+ is 64, allowing for 32768 pages of guest memory to be accessed by PV backends
+ through simultaneous grants. For guests with large numbers of PV devices and
+ high throughput, it may be desirable to increase this value.
+
+Xen paravirtual devices
+-----------------------
+
+The Xen PCI platform device is enabled automatically for a Xen guest. This
+allows a guest to unplug all emulated devices, in order to use paravirtual
+block and network drivers instead.
+
+Those paravirtual Xen block, network (and console) devices can be created
+through the command line, and/or hot-plugged.
+
+To provide a Xen console device, define a character device and then a device
+of type ``xen-console`` to connect to it. For the Xen console equivalent of
+the handy ``-serial mon:stdio`` option, for example:
+
+.. parsed-literal::
+ -chardev stdio,mux=on,id=char0,signal=off -mon char0 \\
+ -device xen-console,chardev=char0
+
+The Xen network device is ``xen-net-device``, which becomes the default NIC
+model for emulated Xen guests, meaning that just the default NIC provided
+by QEMU should automatically work and present a Xen network device to the
+guest.
+
+Disks can be configured with '``-drive file=${GUEST_IMAGE},if=xen``' and will
+appear to the guest as ``xvda`` onwards.
+
+Under Xen, the boot disk is typically available both via IDE emulation, and
+as a PV block device. Guest bootloaders typically use IDE to load the guest
+kernel, which then unplugs the IDE and continues with the Xen PV block device.
+
+This configuration can be achieved as follows:
+
+.. parsed-literal::
+
+ |qemu_system| --accel kvm,xen-version=0x40011,kernel-irqchip=split \\
+ -drive file=${GUEST_IMAGE},if=xen \\
+ -drive file=${GUEST_IMAGE},file.locking=off,if=ide
+
+VirtIO devices can also be used; Linux guests may need to be dissuaded from
+umplugging them by adding '``xen_emul_unplug=never``' on their command line.
+
+Booting Xen PV guests
+---------------------
+
+Booting PV guest kernels is possible by using the Xen PV shim (a version of Xen
+itself, designed to run inside a Xen HVM guest and provide memory management
+services for one guest alone).
+
+The Xen binary is provided as the ``-kernel`` and the guest kernel itself (or
+PV Grub image) as the ``-initrd`` image, which actually just means the first
+multiboot "module". For example:
+
+.. parsed-literal::
+
+ |qemu_system| --accel kvm,xen-version=0x40011,kernel-irqchip=split \\
+ -chardev stdio,id=char0 -device xen-console,chardev=char0 \\
+ -display none -m 1G -kernel xen -initrd bzImage \\
+ -append "pv-shim console=xen,pv -- console=hvc0 root=/dev/xvda1" \\
+ -drive file=${GUEST_IMAGE},if=xen
+
+The Xen image must be built with the ``CONFIG_XEN_GUEST`` and ``CONFIG_PV_SHIM``
+options, and as of Xen 4.17, Xen's PV shim mode does not support using a serial
+port; it must have a Xen console or it will panic.
+
+The example above provides the guest kernel command line after a separator
+(" ``--`` ") on the Xen command line, and does not provide the guest kernel
+with an actual initramfs, which would need to listed as a second multiboot
+module. For more complicated alternatives, see the command line
+:ref:`documentation <system/invocation-qemu-options-initrd>` for the
+``-initrd`` option.
+
+Host OS requirements
+--------------------
+
+The minimal Xen support in the KVM accelerator requires the host to be running
+Linux v5.12 or newer. Later versions add optimisations: Linux v5.17 added
+acceleration of interrupt delivery via the Xen PIRQ mechanism, and Linux v5.19
+accelerated Xen PV timers and inter-processor interrupts (IPIs).
diff --git a/docs/system/images.rst b/docs/system/images.rst
index 3d9144e625..d000bd6b6f 100644
--- a/docs/system/images.rst
+++ b/docs/system/images.rst
@@ -20,7 +20,7 @@ where myimage.img is the disk image filename and mysize is its size in
kilobytes. You can add an ``M`` suffix to give the size in megabytes and
a ``G`` suffix for gigabytes.
-See the qemu-img invocation documentation for more information.
+See the ``qemu-img`` invocation documentation for more information.
.. _disk_005fimages_005fsnapshot_005fmode:
diff --git a/docs/system/index.rst b/docs/system/index.rst
index 73bbedbc22..c21065e519 100644
--- a/docs/system/index.rst
+++ b/docs/system/index.rst
@@ -1,16 +1,18 @@
+.. _System Emulation:
+
----------------
System Emulation
----------------
This section of the manual is the overall guide for users using QEMU
for full system emulation (as opposed to user-mode emulation).
-This includes working with hypervisors such as KVM, Xen, Hax
+This includes working with hypervisors such as KVM, Xen
or Hypervisor.Framework.
.. toctree::
:maxdepth: 3
- quickstart
+ introduction
invocation
device-emulation
keys
@@ -27,6 +29,7 @@ or Hypervisor.Framework.
secrets
authz
gdb
+ replay
managed-startup
bootindex
cpu-hotplug
@@ -34,3 +37,5 @@ or Hypervisor.Framework.
targets
security
multi-process
+ confidential-guest-support
+ vm-templating
diff --git a/docs/system/introduction.rst b/docs/system/introduction.rst
new file mode 100644
index 0000000000..746707eb00
--- /dev/null
+++ b/docs/system/introduction.rst
@@ -0,0 +1,219 @@
+Introduction
+============
+
+.. _Accelerators:
+
+Virtualisation Accelerators
+---------------------------
+
+QEMU's system emulation provides a virtual model of a machine (CPU,
+memory and emulated devices) to run a guest OS. It supports a number
+of hypervisors (known as accelerators) as well as a JIT known as the
+Tiny Code Generator (TCG) capable of emulating many CPUs.
+
+.. list-table:: Supported Accelerators
+ :header-rows: 1
+
+ * - Accelerator
+ - Host OS
+ - Host Architectures
+ * - KVM
+ - Linux
+ - Arm (64 bit only), MIPS, PPC, RISC-V, s390x, x86
+ * - Xen
+ - Linux (as dom0)
+ - Arm, x86
+ * - Hypervisor Framework (hvf)
+ - MacOS
+ - x86 (64 bit only), Arm (64 bit only)
+ * - Windows Hypervisor Platform (whpx)
+ - Windows
+ - x86
+ * - NetBSD Virtual Machine Monitor (nvmm)
+ - NetBSD
+ - x86
+ * - Tiny Code Generator (tcg)
+ - Linux, other POSIX, Windows, MacOS
+ - Arm, x86, Loongarch64, MIPS, PPC, s390x, Sparc64
+
+Feature Overview
+----------------
+
+System emulation provides a wide range of device models to emulate
+various hardware components you may want to add to your machine. This
+includes a wide number of VirtIO devices which are specifically tuned
+for efficient operation under virtualisation. Some of the device
+emulation can be offloaded from the main QEMU process using either
+vhost-user (for VirtIO) or :ref:`Multi-process QEMU`. If the platform
+supports it QEMU also supports directly passing devices through to
+guest VMs to eliminate the device emulation overhead. See
+:ref:`device-emulation` for more details.
+
+There is a full :ref:`featured block layer<Live Block Operations>`
+which allows for construction of complex storage topology which can be
+stacked across multiple layers supporting redirection, networking,
+snapshots and migration support.
+
+The flexible ``chardev`` system allows for handling IO from character
+like devices using stdio, files, unix sockets and TCP networking.
+
+QEMU provides a number of management interfaces including a line based
+:ref:`Human Monitor Protocol (HMP)<QEMU monitor>` that allows you to
+dynamically add and remove devices as well as introspect the system
+state. The :ref:`QEMU Monitor Protocol<QMP Ref>` (QMP) is a well
+defined, versioned, machine usable API that presents a rich interface
+to other tools to create, control and manage Virtual Machines. This is
+the interface used by higher level tools interfaces such as `Virt
+Manager <https://virt-manager.org/>`_ using the `libvirt framework
+<https://libvirt.org>`_.
+
+For the common accelerators QEMU, supported debugging with its
+:ref:`gdbstub<GDB usage>` which allows users to connect GDB and debug
+system software images.
+
+Running
+-------
+
+QEMU provides a rich and complex API which can be overwhelming to
+understand. While some architectures can boot something with just a
+disk image, those examples elide a lot of details with defaults that
+may not be optimal for modern systems.
+
+For a non-x86 system where we emulate a broad range of machine types,
+the command lines are generally more explicit in defining the machine
+and boot behaviour. You will find often find example command lines in
+the :ref:`system-targets-ref` section of the manual.
+
+While the project doesn't want to discourage users from using the
+command line to launch VMs, we do want to highlight that there are a
+number of projects dedicated to providing a more user friendly
+experience. Those built around the ``libvirt`` framework can make use
+of feature probing to build modern VM images tailored to run on the
+hardware you have.
+
+That said, the general form of a QEMU command line can be expressed
+as:
+
+.. parsed-literal::
+
+ $ |qemu_system| [machine opts] \\
+ [cpu opts] \\
+ [accelerator opts] \\
+ [device opts] \\
+ [backend opts] \\
+ [interface opts] \\
+ [boot opts]
+
+Most options will generate some help information. So for example:
+
+.. parsed-literal::
+
+ $ |qemu_system| -M help
+
+will list the machine types supported by that QEMU binary. ``help``
+can also be passed as an argument to another option. For example:
+
+.. parsed-literal::
+
+ $ |qemu_system| -device scsi-hd,help
+
+will list the arguments and their default values of additional options
+that can control the behaviour of the ``scsi-hd`` device.
+
+.. list-table:: Options Overview
+ :header-rows: 1
+ :widths: 10, 90
+
+ * - Options
+ -
+ * - Machine
+ - Define the machine type, amount of memory etc
+ * - CPU
+ - Type and number/topology of vCPUs. Most accelerators offer
+ a ``host`` cpu option which simply passes through your host CPU
+ configuration without filtering out any features.
+ * - Accelerator
+ - This will depend on the hypervisor you run. Note that the
+ default is TCG, which is purely emulated, so you must specify an
+ accelerator type to take advantage of hardware virtualization.
+ * - Devices
+ - Additional devices that are not defined by default with the
+ machine type.
+ * - Backends
+ - Backends are how QEMU deals with the guest's data, for example
+ how a block device is stored, how network devices see the
+ network or how a serial device is directed to the outside world.
+ * - Interfaces
+ - How the system is displayed, how it is managed and controlled or
+ debugged.
+ * - Boot
+ - How the system boots, via firmware or direct kernel boot.
+
+In the following example we first define a ``virt`` machine which is a
+general purpose platform for running Aarch64 guests. We enable
+virtualisation so we can use KVM inside the emulated guest. As the
+``virt`` machine comes with some built in pflash devices we give them
+names so we can override the defaults later.
+
+.. code::
+
+ $ qemu-system-aarch64 \
+ -machine type=virt,virtualization=on,pflash0=rom,pflash1=efivars \
+ -m 4096 \
+
+We then define the 4 vCPUs using the ``max`` option which gives us all
+the Arm features QEMU is capable of emulating. We enable a more
+emulation friendly implementation of Arm's pointer authentication
+algorithm. We explicitly specify TCG acceleration even though QEMU
+would default to it anyway.
+
+.. code::
+
+ -cpu max,pauth-impdef=on \
+ -smp 4 \
+ -accel tcg \
+
+As the ``virt`` platform doesn't have any default network or storage
+devices we need to define them. We give them ids so we can link them
+with the backend later on.
+
+.. code::
+
+ -device virtio-net-pci,netdev=unet \
+ -device virtio-scsi-pci \
+ -device scsi-hd,drive=hd \
+
+We connect the user-mode networking to our network device. As
+user-mode networking isn't directly accessible from the outside world
+we forward localhost port 2222 to the ssh port on the guest.
+
+.. code::
+
+ -netdev user,id=unet,hostfwd=tcp::2222-:22 \
+
+We connect the guest visible block device to an LVM partition we have
+set aside for our guest.
+
+.. code::
+
+ -blockdev driver=raw,node-name=hd,file.driver=host_device,file.filename=/dev/lvm-disk/debian-bullseye-arm64 \
+
+We then tell QEMU to multiplex the :ref:`QEMU monitor` with the serial
+port output (we can switch between the two using :ref:`keys in the
+character backend multiplexer`). As there is no default graphical
+device we disable the display as we can work entirely in the terminal.
+
+.. code::
+
+ -serial mon:stdio \
+ -display none \
+
+Finally we override the default firmware to ensure we have some
+storage for EFI to persist its configuration. That firmware is
+responsible for finding the disk, booting grub and eventually running
+our system.
+
+.. code::
+
+ -blockdev node-name=rom,driver=file,filename=(pwd)/pc-bios/edk2-aarch64-code.fd,read-only=true \
+ -blockdev node-name=efivars,driver=file,filename=$HOME/images/qemu-arm64-efivars
diff --git a/docs/system/invocation.rst b/docs/system/invocation.rst
index 4ba38fc23d..14b7db1c10 100644
--- a/docs/system/invocation.rst
+++ b/docs/system/invocation.rst
@@ -10,6 +10,11 @@ Invocation
disk_image is a raw hard disk image for IDE hard disk 0. Some targets do
not need a disk image.
+When dealing with options parameters as arbitrary strings containing
+commas, such as in "file=my,file" and "string=a,b", it's necessary to
+double the commas. For instance,"-fw_cfg name=z,string=a,,b" will be
+parsed as "-fw_cfg name=z,string=a,b".
+
.. hxtool-doc:: qemu-options.hx
Device URL Syntax
diff --git a/docs/system/keys.rst b/docs/system/keys.rst
index e596ae6c4e..0fc17b994d 100644
--- a/docs/system/keys.rst
+++ b/docs/system/keys.rst
@@ -1,4 +1,4 @@
-.. _pcsys_005fkeys:
+.. _GUI_keys:
Keys in the graphical frontends
-------------------------------
diff --git a/docs/system/keys.rst.inc b/docs/system/keys.rst.inc
index bd9b8e5f6f..59966a3fe7 100644
--- a/docs/system/keys.rst.inc
+++ b/docs/system/keys.rst.inc
@@ -1,8 +1,9 @@
-During the graphical emulation, you can use special key combinations to
-change modes. The default key mappings are shown below, but if you use
-``-alt-grab`` then the modifier is Ctrl-Alt-Shift (instead of Ctrl-Alt)
-and if you use ``-ctrl-grab`` then the modifier is the right Ctrl key
-(instead of Ctrl-Alt):
+During the graphical emulation, you can use special key combinations from
+the following table to change modes. By default the modifier is Ctrl-Alt
+(used in the table below) which can be changed with ``-display`` suboption
+``mod=`` where appropriate. For example, ``-display sdl,
+grab-mod=lshift-lctrl-lalt`` changes the modifier key to Ctrl-Alt-Shift,
+while ``-display sdl,grab-mod=rctrl`` changes it to the right Ctrl key.
Ctrl-Alt-f
Toggle full screen
@@ -28,7 +29,7 @@ Ctrl-Alt-n
*3*
Serial port
-Ctrl-Alt
+Ctrl-Alt-g
Toggle mouse and keyboard grab.
In the virtual consoles, you can use Ctrl-Up, Ctrl-Down, Ctrl-PageUp and
diff --git a/docs/system/linuxboot.rst b/docs/system/linuxboot.rst
index 228650abc5..5db2e560dc 100644
--- a/docs/system/linuxboot.rst
+++ b/docs/system/linuxboot.rst
@@ -27,4 +27,4 @@ virtual serial port and the QEMU monitor to the console with the
-append "root=/dev/hda console=ttyS0" -nographic
Use Ctrl-a c to switch between the serial console and the monitor (see
-:ref:`pcsys_005fkeys`).
+:ref:`GUI_keys`).
diff --git a/docs/system/loongarch/virt.rst b/docs/system/loongarch/virt.rst
new file mode 100644
index 0000000000..06d034b8ef
--- /dev/null
+++ b/docs/system/loongarch/virt.rst
@@ -0,0 +1,108 @@
+:orphan:
+
+==========================================
+loongson3 virt generic platform (``virt``)
+==========================================
+
+The ``virt`` machine use gpex host bridge, and there are some
+emulated devices on virt board, such as loongson7a RTC device,
+IOAPIC device, ACPI device and so on.
+
+Supported devices
+-----------------
+
+The ``virt`` machine supports:
+- Gpex host bridge
+- Ls7a RTC device
+- Ls7a IOAPIC device
+- ACPI GED device
+- Fw_cfg device
+- PCI/PCIe devices
+- Memory device
+- CPU device. Type: la464.
+
+CPU and machine Type
+--------------------
+
+The ``qemu-system-loongarch64`` provides emulation for virt
+machine. You can specify the machine type ``virt`` and
+cpu type ``la464``.
+
+Boot options
+------------
+
+We can boot the LoongArch virt machine by specifying the uefi bios,
+initrd, and linux kernel. And those source codes and binary files
+can be accessed by following steps.
+
+(1) Build qemu-system-loongarch64:
+
+.. code-block:: bash
+
+ ./configure --disable-rdma --prefix=/usr \
+ --target-list="loongarch64-softmmu" \
+ --disable-libiscsi --disable-libnfs --disable-libpmem \
+ --disable-glusterfs --enable-libusb --enable-usb-redir \
+ --disable-opengl --disable-xen --enable-spice \
+ --enable-debug --disable-capstone --disable-kvm \
+ --enable-profiler
+ make -j8
+
+(2) Set cross tools:
+
+.. code-block:: bash
+
+ wget https://github.com/loongson/build-tools/releases/download/2022.09.06/loongarch64-clfs-6.3-cross-tools-gcc-glibc.tar.xz
+
+ tar -vxf loongarch64-clfs-6.3-cross-tools-gcc-glibc.tar.xz -C /opt
+
+ export PATH=/opt/cross-tools/bin:$PATH
+ export LD_LIBRARY_PATH=/opt/cross-tools/lib:$LD_LIBRARY_PATH
+ export LD_LIBRARY_PATH=/opt/cross-tools/loongarch64-unknown-linux-gnu/lib/:$LD_LIBRARY_PATH
+
+Note: You need get the latest cross-tools at https://github.com/loongson/build-tools
+
+(3) Build BIOS:
+
+ See: https://github.com/tianocore/edk2-platforms/tree/master/Platform/Loongson/LoongArchQemuPkg#readme
+
+Note: To build the release version of the bios, set --buildtarget=RELEASE,
+ the bios file path: Build/LoongArchQemu/RELEASE_GCC5/FV/QEMU_EFI.fd
+
+(4) Build kernel:
+
+.. code-block:: bash
+
+ git clone https://github.com/loongson/linux.git
+
+ cd linux
+
+ git checkout loongarch-next
+
+ make ARCH=loongarch CROSS_COMPILE=loongarch64-unknown-linux-gnu- loongson3_defconfig
+
+ make ARCH=loongarch CROSS_COMPILE=loongarch64-unknown-linux-gnu- -j32
+
+Note: The branch of linux source code is loongarch-next.
+ the kernel file: arch/loongarch/boot/vmlinuz.efi
+
+(5) Get initrd:
+
+ You can use busybox tool and the linux modules to make a initrd file. Or you can access the
+ binary files: https://github.com/yangxiaojuan-loongson/qemu-binary
+
+.. code-block:: bash
+
+ git clone https://github.com/yangxiaojuan-loongson/qemu-binary
+
+Note: the initrd file is ramdisk
+
+(6) Booting LoongArch:
+
+.. code-block:: bash
+
+ $ ./build/qemu-system-loongarch64 -machine virt -m 4G -cpu la464 \
+ -smp 1 -bios QEMU_EFI.fd -kernel vmlinuz.efi -initrd ramdisk \
+ -serial stdio -monitor telnet:localhost:4495,server,nowait \
+ -append "root=/dev/ram rdinit=/sbin/init console=ttyS0,115200" \
+ --nographic
diff --git a/docs/system/multi-process.rst b/docs/system/multi-process.rst
index 210531ee17..2008a67809 100644
--- a/docs/system/multi-process.rst
+++ b/docs/system/multi-process.rst
@@ -1,8 +1,10 @@
+.. _Multi-process QEMU:
+
Multi-process QEMU
==================
This document describes how to configure and use multi-process qemu.
-For the design document refer to docs/devel/qemu-multiprocess.
+For the design document refer to docs/devel/multi-process.rst.
1) Configuration
----------------
diff --git a/docs/system/openrisc/cpu-features.rst b/docs/system/openrisc/cpu-features.rst
new file mode 100644
index 0000000000..aeb65e22ff
--- /dev/null
+++ b/docs/system/openrisc/cpu-features.rst
@@ -0,0 +1,15 @@
+CPU Features
+============
+
+The QEMU emulation of the OpenRISC architecture provides following built in
+features.
+
+- Shadow GPRs
+- MMU TLB with 128 entries, 1 way
+- Power Management (PM)
+- Programmable Interrupt Controller (PIC)
+- Tick Timer
+
+These features are on by default and the presence can be confirmed by checking
+the contents of the Unit Presence Register (``UPR``) and CPU Configuration
+Register (``CPUCFGR``).
diff --git a/docs/system/openrisc/emulation.rst b/docs/system/openrisc/emulation.rst
new file mode 100644
index 0000000000..0af898ab20
--- /dev/null
+++ b/docs/system/openrisc/emulation.rst
@@ -0,0 +1,17 @@
+OpenRISC 1000 CPU architecture support
+======================================
+
+QEMU's TCG emulation includes support for the OpenRISC or1200 implementation of
+the OpenRISC 1000 cpu architecture.
+
+The or1200 cpu also has support for the following instruction subsets:
+
+- ORBIS32 (OpenRISC Basic Instruction Set)
+- ORFPX32 (OpenRISC Floating-Point eXtension)
+
+In addition to the instruction subsets the QEMU TCG emulation also has support
+for most Class II (optional) instructions.
+
+For information on all OpenRISC instructions please refer to the latest
+architecture manual available on the OpenRISC website in the
+`OpenRISC Architecture <https://openrisc.io/architecture>`_ section.
diff --git a/docs/system/openrisc/or1k-sim.rst b/docs/system/openrisc/or1k-sim.rst
new file mode 100644
index 0000000000..ef10439737
--- /dev/null
+++ b/docs/system/openrisc/or1k-sim.rst
@@ -0,0 +1,43 @@
+Or1ksim board
+=============
+
+The QEMU Or1ksim machine emulates the standard OpenRISC board simulator which is
+also the standard SoC configuration.
+
+Supported devices
+-----------------
+
+ * 16550A UART
+ * ETHOC Ethernet controller
+ * SMP (OpenRISC multicore using ompic)
+
+Boot options
+------------
+
+The Or1ksim machine can be started using the ``-kernel`` and ``-initrd`` options
+to load a Linux kernel and optional disk image.
+
+.. code-block:: bash
+
+ $ qemu-system-or1k -cpu or1220 -M or1k-sim -nographic \
+ -kernel vmlinux \
+ -initrd initramfs.cpio.gz \
+ -m 128
+
+Linux guest kernel configuration
+""""""""""""""""""""""""""""""""
+
+The 'or1ksim_defconfig' for Linux openrisc kernels includes the right
+drivers for the or1ksim machine. If you would like to run an SMP system
+choose the 'simple_smp_defconfig' config.
+
+Hardware configuration information
+""""""""""""""""""""""""""""""""""
+
+The ``or1k-sim`` board automatically generates a device tree blob ("dtb")
+which it passes to the guest. This provides information about the
+addresses, interrupt lines and other configuration of the various devices
+in the system.
+
+The location of the DTB will be passed in register ``r3`` to the guest operating
+system.
diff --git a/docs/system/openrisc/virt.rst b/docs/system/openrisc/virt.rst
new file mode 100644
index 0000000000..2fe61ac942
--- /dev/null
+++ b/docs/system/openrisc/virt.rst
@@ -0,0 +1,50 @@
+'virt' generic virtual platform
+===============================
+
+The ``virt`` board is a platform which does not correspond to any
+real hardware; it is designed for use in virtual machines.
+It is the recommended board type if you simply want to run
+a guest such as Linux and do not care about reproducing the
+idiosyncrasies and limitations of a particular bit of real-world
+hardware.
+
+Supported devices
+-----------------
+
+ * PCI/PCIe devices
+ * 8 virtio-mmio transport devices
+ * 16550A UART
+ * Goldfish RTC
+ * SiFive Test device for poweroff and reboot
+ * SMP (OpenRISC multicore using ompic)
+
+Boot options
+------------
+
+The virt machine can be started using the ``-kernel`` and ``-initrd`` options
+to load a Linux kernel and optional disk image. For example:
+
+.. code-block:: bash
+
+ $ qemu-system-or1k -cpu or1220 -M or1k-sim -nographic \
+ -device virtio-net-device,netdev=user -netdev user,id=user,net=10.9.0.1/24,host=10.9.0.100 \
+ -device virtio-blk-device,drive=d0 -drive file=virt.qcow2,id=d0,if=none,format=qcow2 \
+ -kernel vmlinux \
+ -initrd initramfs.cpio.gz \
+ -m 128
+
+Linux guest kernel configuration
+""""""""""""""""""""""""""""""""
+
+The 'virt_defconfig' for Linux openrisc kernels includes the right drivers for
+the ``virt`` machine.
+
+Hardware configuration information
+""""""""""""""""""""""""""""""""""
+
+The ``virt`` board automatically generates a device tree blob ("dtb") which it
+passes to the guest. This provides information about the addresses, interrupt
+lines and other configuration of the various devices in the system.
+
+The location of the DTB will be passed in register ``r3`` to the guest operating
+system.
diff --git a/docs/system/ppc/amigang.rst b/docs/system/ppc/amigang.rst
new file mode 100644
index 0000000000..e2c9cb74b7
--- /dev/null
+++ b/docs/system/ppc/amigang.rst
@@ -0,0 +1,161 @@
+=========================================================
+AmigaNG boards (``amigaone``, ``pegasos2``, ``sam460ex``)
+=========================================================
+
+These PowerPC machines emulate boards that are primarily used for
+running Amiga like OSes (AmigaOS 4, MorphOS and AROS) but these can
+also run Linux which is what this section documents.
+
+Eyetech AmigaOne/Mai Logic Teron (``amigaone``)
+===============================================
+
+The ``amigaone`` machine emulates an AmigaOne XE mainboard by Eyetech
+which is a rebranded Mai Logic Teron board with modified U-Boot
+firmware to support AmigaOS 4.
+
+Emulated devices
+----------------
+
+ * PowerPC 7457 CPU (can also use ``-cpu g3, 750cxe, 750fx`` or ``750gx``)
+ * Articia S north bridge
+ * VIA VT82C686B south bridge
+ * PCI VGA compatible card (guests may need other card instead)
+ * PS/2 keyboard and mouse
+
+Firmware
+--------
+
+A firmware binary is necessary for the boot process. It is a modified
+U-Boot under GPL but its source is lost so it cannot be included in
+QEMU. A binary is available at
+https://www.hyperion-entertainment.com/index.php/downloads?view=files&parent=28.
+The ROM image is in the last 512kB which can be extracted with the
+following command:
+
+.. code-block:: bash
+
+ $ tail -c 524288 updater.image > u-boot-amigaone.bin
+
+The BIOS emulator in the firmware is unable to run QEMU‘s standard
+vgabios so ``VGABIOS-lgpl-latest.bin`` is needed instead which can be
+downloaded from http://www.nongnu.org/vgabios.
+
+Running Linux
+-------------
+
+There are some Linux images under the following link that work on the
+``amigaone`` machine:
+https://sourceforge.net/projects/amigaone-linux/files/debian-installer/.
+To boot the system run:
+
+.. code-block:: bash
+
+ $ qemu-system-ppc -machine amigaone -bios u-boot-amigaone.bin \
+ -cdrom "A1 Linux Net Installer.iso" \
+ -device ati-vga,model=rv100,romfile=VGABIOS-lgpl-latest.bin
+
+From the firmware menu that appears select ``Boot sequence`` →
+``Amiga Multiboot Options`` and set ``Boot device 1`` to
+``Onboard VIA IDE CDROM``. Then hit escape until the main screen appears again,
+hit escape once more and from the exit menu that appears select either
+``Save settings and exit`` or ``Use settings for this session only``. It may
+take a long time loading the kernel into memory but eventually it boots and the
+installer becomes visible. The ``ati-vga`` RV100 emulation is not
+complete yet so only frame buffer works, DRM and 3D is not available.
+
+Genesi/bPlan Pegasos II (``pegasos2``)
+======================================
+
+The ``pegasos2`` machine emulates the Pegasos II sold by Genesi and
+designed by bPlan. Its schematics are available at
+https://www.powerdeveloper.org/platforms/pegasos/schematics.
+
+Emulated devices
+----------------
+
+ * PowerPC 7457 CPU (can also use ``-cpu g3`` or ``750cxe``)
+ * Marvell MV64361 Discovery II north bridge
+ * VIA VT8231 south bridge
+ * PCI VGA compatible card (guests may need other card instead)
+ * PS/2 keyboard and mouse
+
+Firmware
+--------
+
+The Pegasos II board has an Open Firmware compliant ROM based on
+SmartFirmware with some changes that are not open-sourced therefore
+the ROM binary cannot be included in QEMU. An updater was available
+from bPlan, it can be found in the `Internet Archive
+<http://web.archive.org/web/20071021223056/http://www.bplan-gmbh.de/up050404/up050404>`_.
+The ROM image can be extracted from it with the following command:
+
+.. code-block:: bash
+
+ $ tail -c +85581 up050404 | head -c 524288 > pegasos2.rom
+
+Running Linux
+-------------
+
+The PowerPC version of Debian 8.11 supported Pegasos II. The BIOS
+emulator in the firmware binary is unable to run QEMU‘s standard
+vgabios so it needs to be disabled. To boot the system run:
+
+.. code-block:: bash
+
+ $ qemu-system-ppc -machine pegasos2 -bios pegasos2.rom \
+ -cdrom debian-8.11.0-powerpc-netinst.iso \
+ -device VGA,romfile="" -serial stdio
+
+At the firmware ``ok`` prompt enter ``boot cd install/pegasos``.
+
+Alternatively, it is possible to boot the kernel directly without
+firmware ROM using the QEMU built-in minimal Virtual Open Firmware
+(VOF) emulation which is also supported on ``pegasos2``. For this,
+extract the kernel ``install/powerpc/vmlinuz-chrp.initrd`` from the CD
+image, then run:
+
+.. code-block:: bash
+
+ $ qemu-system-ppc -machine pegasos2 -serial stdio \
+ -kernel vmlinuz-chrp.initrd -append "---" \
+ -cdrom debian-8.11.0-powerpc-netinst.iso
+
+aCube Sam460ex (``sam460ex``)
+=============================
+
+The ``sam460ex`` machine emulates the Sam460ex board by aCube which is
+based on the AMCC PowerPC 460EX SoC (that despite its name has a
+PPC440 CPU core).
+
+Firmware
+--------
+
+The board has a firmware based on an older U-Boot version with
+modifications to support booting AmigaOS 4. The firmware ROM is
+included with QEMU.
+
+Emulated devices
+----------------
+
+ * PowerPC 460EX SoC
+ * M41T80 serial RTC chip
+ * Silicon Motion SM501 display parts (identical to SM502 on real board)
+ * Silicon Image SiI3112 2 port SATA controller
+ * USB keyboard and mouse
+
+Running Linux
+-------------
+
+The only Linux distro that supported Sam460ex out of box was CruxPPC
+2.x. It can be booted by running:
+
+.. code-block:: bash
+
+ $ qemu-system-ppc -machine sam460ex -serial stdio \
+ -drive if=none,id=cd,format=raw,file=crux-ppc-2.7a.iso \
+ -device ide-cd,drive=cd,bus=ide.1
+
+There are some other kernels and instructions for booting other
+distros on aCube's product page at
+https://www.acube-systems.biz/index.php?page=hardware&pid=5
+but those are untested.
diff --git a/docs/system/ppc/embedded.rst b/docs/system/ppc/embedded.rst
index cfffbda24d..af3b3d9fa4 100644
--- a/docs/system/ppc/embedded.rst
+++ b/docs/system/ppc/embedded.rst
@@ -6,5 +6,4 @@ Embedded family boards
- ``ppce500`` generic paravirt e500 platform
- ``ref405ep`` ref405ep
- ``sam460ex`` aCube Sam460ex
-- ``taihu`` taihu
- ``virtex-ml507`` Xilinx Virtex ML507 reference design
diff --git a/docs/system/ppc/powernv.rst b/docs/system/ppc/powernv.rst
index 4c4cdea527..09f3965858 100644
--- a/docs/system/ppc/powernv.rst
+++ b/docs/system/ppc/powernv.rst
@@ -1,7 +1,7 @@
-PowerNV family boards (``powernv8``, ``powernv9``)
+PowerNV family boards (``powernv8``, ``powernv9``, ``powernv10``)
==================================================================
-PowerNV (as Non-Virtualized) is the "baremetal" platform using the
+PowerNV (as Non-Virtualized) is the "bare metal" platform using the
OPAL firmware. It runs Linux on IBM and OpenPOWER systems and it can
be used as an hypervisor OS, running KVM guests, or simply as a host
OS.
@@ -16,16 +16,14 @@ Supported devices
-----------------
* Multi processor support for POWER8, POWER8NVL and POWER9.
- * XSCOM, serial communication sideband bus to configure chiplets
- * Simple LPC Controller
- * Processor Service Interface (PSI) Controller
- * Interrupt Controller, XICS (POWER8) and XIVE (POWER9)
- * POWER8 PHB3 PCIe Host bridge and POWER9 PHB4 PCIe Host bridge
- * Simple OCC is an on-chip microcontroller used for power management
- tasks
- * iBT device to handle BMC communication, with the internal BMC
- simulator provided by QEMU or an external BMC such as an Aspeed
- QEMU machine.
+ * XSCOM, serial communication sideband bus to configure chiplets.
+ * Simple LPC Controller.
+ * Processor Service Interface (PSI) Controller.
+ * Interrupt Controller, XICS (POWER8) and XIVE (POWER9) and XIVE2 (Power10).
+ * POWER8 PHB3 PCIe Host bridge and POWER9 PHB4 PCIe Host bridge.
+ * Simple OCC is an on-chip micro-controller used for power management tasks.
+ * iBT device to handle BMC communication, with the internal BMC simulator
+ provided by QEMU or an external BMC such as an Aspeed QEMU machine.
* PNOR containing the different firmware partitions.
Missing devices
@@ -33,32 +31,42 @@ Missing devices
A lot is missing, among which :
- * POWER10 processor
- * XIVE2 (POWER10) interrupt controller
- * I2C controllers (yet to be merged)
- * NPU/NPU2/NPU3 controllers
- * EEH support for PCIe Host bridge controllers
- * NX controller
- * VAS controller
- * chipTOD (Time Of Day)
+ * I2C controllers (yet to be merged).
+ * NPU/NPU2/NPU3 controllers.
+ * EEH support for PCIe Host bridge controllers.
+ * NX controller.
+ * VAS controller.
+ * chipTOD (Time Of Day).
* Self Boot Engine (SBE).
- * FSI bus
+ * FSI bus.
Firmware
--------
The OPAL firmware (OpenPower Abstraction Layer) for OpenPower systems
includes the runtime services ``skiboot`` and the bootloader kernel and
-initramfs ``skiroot``. Source code can be found on GitHub:
+initramfs ``skiroot``. Source code can be found on the `OpenPOWER account at
+GitHub <https://github.com/open-power>`_.
- https://github.com/open-power.
-
-Prebuilt images of ``skiboot`` and ``skiboot`` are made available on the `OpenPOWER <https://openpower.xyz/job/openpower/job/openpower-op-build/>`__ site. To boot a POWER9 machine, use the `witherspoon <https://openpower.xyz/job/openpower/job/openpower-op-build/label=slave,target=witherspoon/lastSuccessfulBuild/>`__ images. For POWER8, use
-the `palmetto <https://openpower.xyz/job/openpower/job/openpower-op-build/label=slave,target=palmetto/lastSuccessfulBuild/>`__ images.
+Prebuilt images of ``skiboot`` and ``skiroot`` are made available on the
+`OpenPOWER <https://github.com/open-power/op-build/releases/>`__ site.
QEMU includes a prebuilt image of ``skiboot`` which is updated when a
more recent version is required by the models.
+Current acceleration status
+---------------------------
+
+KVM acceleration in Linux Power hosts is provided by the kvm-hv and
+kvm-pr modules. kvm-hv is adherent to PAPR and it's not compliant with
+powernv. kvm-pr in theory could be used as a valid accel option but
+this isn't supported by kvm-pr at this moment.
+
+To spare users from dealing with not so informative errors when attempting
+to use accel=kvm, the powernv machine will throw an error informing that
+KVM is not supported. This can be revisited in the future if kvm-pr (or
+any other KVM alternative) is usable as KVM accel for this machine.
+
Boot options
------------
@@ -84,6 +92,7 @@ and a SATA disk :
Complex PCIe configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~
+
Six PHBs are defined per chip (POWER9) but no default PCI layout is
provided (to be compatible with libvirt). One PCI device can be added
on any of the available PCIe slots using command line options such as:
@@ -158,7 +167,7 @@ one on the command line :
The files `palmetto-SDR.bin <http://www.kaod.org/qemu/powernv/palmetto-SDR.bin>`__
and `palmetto-FRU.bin <http://www.kaod.org/qemu/powernv/palmetto-FRU.bin>`__
define a Sensor Data Record repository and a Field Replaceable Unit
-inventory for a palmetto BMC. They can be used to extend the QEMU BMC
+inventory for a Palmetto BMC. They can be used to extend the QEMU BMC
simulator.
.. code-block:: bash
@@ -186,8 +195,7 @@ Use a MTD drive to add a PNOR to the machine, and get a NVRAM :
-drive file=./witherspoon.pnor,format=raw,if=mtd
-CAVEATS
--------
+Maintainer contact information
+------------------------------
- * No support for multiple HW threads (SMT=1). Same as pseries.
- * CPU can hang when doing intensive I/Os. Use ``-append powersave=off`` in that case.
+Cédric Le Goater <clg@kaod.org>
diff --git a/docs/system/ppc/ppce500.rst b/docs/system/ppc/ppce500.rst
index afc58f60f5..c9fe0915dc 100644
--- a/docs/system/ppc/ppce500.rst
+++ b/docs/system/ppc/ppce500.rst
@@ -19,6 +19,7 @@ The ``ppce500`` machine supports the following devices:
* Power-off functionality via one GPIO pin
* 1 Freescale MPC8xxx PCI host controller
* VirtIO devices via PCI bus
+* 1 Freescale Enhanced Secure Digital Host controller (eSDHC)
* 1 Freescale Enhanced Triple Speed Ethernet controller (eTSEC)
Hardware configuration information
@@ -75,7 +76,7 @@ as the BIOS. QEMU follows below truth table to select which payload to execute:
When both -bios and -kernel are present, QEMU loads U-Boot and U-Boot in turns
automatically loads the kernel image specified by the -kernel parameter via
U-Boot's built-in "bootm" command, hence a legacy uImage format is required in
-such senario.
+such scenario.
Running Linux kernel
--------------------
@@ -113,7 +114,7 @@ To boot the 32-bit Linux kernel:
.. code-block:: bash
- $ qemu-system-ppc{64|32} -M ppce500 -cpu e500mc -smp 4 -m 2G \
+ $ qemu-system-ppc64 -M ppce500 -cpu e500mc -smp 4 -m 2G \
-display none -serial stdio \
-kernel vmlinux \
-initrd /path/to/rootfs.cpio \
@@ -146,15 +147,18 @@ You can specify a real world SoC device that QEMU has built-in support but all
these SoCs are e500v2 based MPC85xx series, hence you cannot test anything
built for P4080 (e500mc), P5020 (e5500) and T2080 (e6500).
+Networking
+----------
+
By default a VirtIO standard PCI networking device is connected as an ethernet
interface at PCI address 0.1.0, but we can switch that to an e1000 NIC by:
.. code-block:: bash
- $ qemu-system-ppc -M ppce500 -smp 4 -m 2G \
- -display none -serial stdio \
- -bios u-boot \
- -nic tap,ifname=tap0,script=no,downscript=no,model=e1000
+ $ qemu-system-ppc64 -M ppce500 -smp 4 -m 2G \
+ -display none -serial stdio \
+ -bios u-boot \
+ -nic tap,ifname=tap0,script=no,downscript=no,model=e1000
The QEMU ``ppce500`` machine can also dynamically instantiate an eTSEC device
if “-device eTSEC” is given to QEMU:
@@ -162,3 +166,30 @@ if “-device eTSEC” is given to QEMU:
.. code-block:: bash
-netdev tap,ifname=tap0,script=no,downscript=no,id=net0 -device eTSEC,netdev=net0
+
+Root file system on flash drive
+-------------------------------
+
+Rather than using a root file system on ram disk, it is possible to have it on
+CFI flash. Given an ext2 image whose size must be a power of two, it can be used
+as follows:
+
+.. code-block:: bash
+
+ $ qemu-system-ppc64 -M ppce500 -cpu e500mc -smp 4 -m 2G \
+ -display none -serial stdio \
+ -kernel vmlinux \
+ -drive if=pflash,file=/path/to/rootfs.ext2,format=raw \
+ -append "rootwait root=/dev/mtdblock0"
+
+Alternatively, the root file system can also reside on an emulated SD card
+whose size must again be a power of two:
+
+.. code-block:: bash
+
+ $ qemu-system-ppc64 -M ppce500 -cpu e500mc -smp 4 -m 2G \
+ -display none -serial stdio \
+ -kernel vmlinux \
+ -device sd-card,drive=mydrive \
+ -drive id=mydrive,if=none,file=/path/to/rootfs.ext2,format=raw \
+ -append "rootwait root=/dev/mmcblk0"
diff --git a/docs/system/ppc/pseries.rst b/docs/system/ppc/pseries.rst
index 932d4dd17d..a876d897b6 100644
--- a/docs/system/ppc/pseries.rst
+++ b/docs/system/ppc/pseries.rst
@@ -1,12 +1,298 @@
+===================================
pSeries family boards (``pseries``)
===================================
+The Power machine para-virtualized environment described by the Linux on Power
+Architecture Reference ([LoPAR]_) document is called pSeries. This environment
+is also known as sPAPR, System p guests, or simply Power Linux guests (although
+it is capable of running other operating systems, such as AIX).
+
+Even though pSeries is designed to behave as a guest environment, it is also
+capable of acting as a hypervisor OS, providing, on that role, nested
+virtualization capabilities.
+
Supported devices
------------------
+=================
+
+ * Multi processor support for many Power processors generations: POWER7,
+ POWER7+, POWER8, POWER8NVL, POWER9, and Power10. Support for POWER5+ exists,
+ but its state is unknown.
+ * Interrupt Controller, XICS (POWER8) and XIVE (POWER9 and Power10)
+ * vPHB PCIe Host bridge.
+ * vscsi and vnet devices, compatible with the same devices available on a
+ PowerVM hypervisor with VIOS managing LPARs.
+ * Virtio based devices.
+ * PCIe device pass through.
Missing devices
----------------
+===============
+ * SPICE support.
Firmware
---------
+========
+
+The pSeries platform in QEMU comes with 2 firmwares:
+
+`SLOF <https://github.com/aik/SLOF>`_ (Slimline Open Firmware) is an
+implementation of the `IEEE 1275-1994, Standard for Boot (Initialization
+Configuration) Firmware: Core Requirements and Practices
+<https://standards.ieee.org/standard/1275-1994.html>`_.
+
+SLOF performs bus scanning, PCI resource allocation, provides the client
+interface to boot from block devices and network.
+
+QEMU includes a prebuilt image of SLOF which is updated when a more recent
+version is required.
+
+VOF (Virtual Open Firmware) is a minimalistic firmware to work with
+``-machine pseries,x-vof=on``. When enabled, the firmware acts as a slim
+shim and QEMU implements parts of the IEEE 1275 Open Firmware interface.
+
+VOF does not have device drivers, does not do PCI resource allocation and
+relies on ``-kernel`` used with Linux kernels recent enough (v5.4+)
+to PCI resource assignment. It is ideal to use with petitboot.
+
+Booting via ``-kernel`` supports the following:
+
++-------------------+-------------------+------------------+
+| kernel | pseries,x-vof=off | pseries,x-vof=on |
++===================+===================+==================+
+| vmlinux BE | ✓ | ✓ |
++-------------------+-------------------+------------------+
+| vmlinux LE | ✓ | ✓ |
++-------------------+-------------------+------------------+
+| zImage.pseries BE | ✓¹ | ✓¹ |
++-------------------+-------------------+------------------+
+| zImage.pseries LE | ✓ | ✓ |
++-------------------+-------------------+------------------+
+
+¹ must set kernel-addr=0
+
+Build directions
+================
+
+.. code-block:: bash
+
+ ./configure --target-list=ppc64-softmmu && make
+
+Running instructions
+====================
+
+Someone can select the pSeries machine type by running QEMU with the following
+options:
+
+.. code-block:: bash
+
+ qemu-system-ppc64 -M pseries <other QEMU arguments>
+
+sPAPR devices
+=============
+
+The sPAPR specification defines a set of para-virtualized devices, which are
+also supported by the pSeries machine in QEMU and can be instantiated with the
+``-device`` option:
+
+* ``spapr-vlan`` : a virtual network interface.
+* ``spapr-vscsi`` : a virtual SCSI disk interface.
+* ``spapr-rng`` : a pseudo-device for passing random number generator data to the
+ guest (see the `H_RANDOM hypercall feature
+ <https://wiki.qemu.org/Features/HRandomHypercall>`_ for details).
+* ``spapr-vty``: a virtual teletype.
+* ``spapr-pci-host-bridge``: a PCI host bridge.
+* ``tpm-spapr``: a Trusted Platform Module (TPM).
+* ``spapr-tpm-proxy``: a TPM proxy.
+
+These are compatible with the devices historically available for use when
+running the IBM PowerVM hypervisor with LPARs.
+
+However, since these devices have originally been specified with another
+hypervisor and non-Linux guests in mind, you should use the virtio counterparts
+(virtio-net, virtio-blk/scsi and virtio-rng for instance) if possible instead,
+since they will most probably give you better performance with Linux guests in a
+QEMU environment.
+
+The pSeries machine in QEMU is always instantiated with the following devices:
+
+* A NVRAM device (``spapr-nvram``).
+* A virtual teletype (``spapr-vty``).
+* A PCI host bridge (``spapr-pci-host-bridge``).
+
+Hence, it is not needed to add them manually, unless you use the ``-nodefaults``
+command line option in QEMU.
+
+In the case of the default ``spapr-nvram`` device, if someone wants to make the
+contents of the NVRAM device persistent, they will need to specify a PFLASH
+device when starting QEMU, i.e. either use
+``-drive if=pflash,file=<filename>,format=raw`` to set the default PFLASH
+device, or specify one with an ID
+(``-drive if=none,file=<filename>,format=raw,id=pfid``) and pass that ID to the
+NVRAM device with ``-global spapr-nvram.drive=pfid``.
+
+sPAPR specification
+-------------------
+
+The main source of documentation on the sPAPR standard is the [LoPAR]_ document.
+However, documentation specific to QEMU's implementation of the specification
+can also be found in QEMU documentation:
+
+.. toctree::
+ :maxdepth: 1
+
+ ../../specs/ppc-spapr-hotplug.rst
+ ../../specs/ppc-spapr-hcalls.rst
+ ../../specs/ppc-spapr-numa.rst
+ ../../specs/ppc-spapr-uv-hcalls.rst
+ ../../specs/ppc-spapr-xive.rst
+
+Switching between the KVM-PR and KVM-HV kernel module
+=====================================================
+
+Currently, there are two implementations of KVM on Power, ``kvm_hv.ko`` and
+``kvm_pr.ko``.
+
+
+If a host supports both KVM modes, and both KVM kernel modules are loaded, it is
+possible to switch between the two modes with the ``kvm-type`` parameter:
+
+* Use ``qemu-system-ppc64 -M pseries,accel=kvm,kvm-type=PR`` to use the
+ ``kvm_pr.ko`` kernel module.
+* Use ``qemu-system-ppc64 -M pseries,accel=kvm,kvm-type=HV`` to use ``kvm_hv.ko``
+ instead.
+
+KVM-PR
+------
+
+KVM-PR uses the so-called **PR**\ oblem state of the PPC CPUs to run the guests,
+i.e. the virtual machine is run in user mode and all privileged instructions
+trap and have to be emulated by the host. That means you can run KVM-PR inside
+a pSeries guest (or a PowerVM LPAR for that matter), and that is where it has
+originated, as historically (prior to POWER7) it was not possible to run Linux
+on hypervisor mode on a Power processor (this function was restricted to
+PowerVM, the IBM proprietary hypervisor).
+
+Because all privileged instructions are trapped, guests that use a lot of
+privileged instructions run quite slow with KVM-PR. On the other hand, because
+of that, this kernel module can run on pretty much every PPC hardware, and is
+able to emulate a lot of guests CPUs. This module can even be used to run other
+PowerPC guests like an emulated PowerMac.
+
+As KVM-PR can be run inside a pSeries guest, it can also provide nested
+virtualization capabilities (i.e. running a guest from within a guest).
+
+It is important to notice that, as KVM-HV provides a much better execution
+performance, maintenance work has been much more focused on it in the past
+years. Maintenance for KVM-PR has been minimal.
+
+In order to run KVM-PR guests with POWER9 processors, someone will need to start
+QEMU with ``kernel_irqchip=off`` command line option.
+
+KVM-HV
+------
+
+KVM-HV uses the hypervisor mode of more recent Power processors, that allow
+access to the bare metal hardware directly. Although POWER7 had this capability,
+it was only starting with POWER8 that this was officially supported by IBM.
+
+Originally, KVM-HV was only available when running on a PowerNV platform (a.k.a.
+Power bare metal). Although it runs on a PowerNV platform, it can only be used
+to start pSeries guests. As the pSeries guest doesn't have access to the
+hypervisor mode of the Power CPU, it wasn't possible to run KVM-HV on a guest.
+This limitation has been lifted, and now it is possible to run KVM-HV inside
+pSeries guests as well, making nested virtualization possible with KVM-HV.
+
+As KVM-HV has access to privileged instructions, guests that use a lot of these
+can run much faster than with KVM-PR. On the other hand, the guest CPU has to be
+of the same type as the host CPU this way, e.g. it is not possible to specify an
+embedded PPC CPU for the guest with KVM-HV. However, there is at least the
+possibility to run the guest in a backward-compatibility mode of the previous
+CPUs generations, e.g. you can run a POWER7 guest on a POWER8 host by using
+``-cpu POWER8,compat=power7`` as parameter to QEMU.
+
+Modules support
+===============
+
+As noticed in the sections above, each module can run in a different
+environment. The following table shows with which environment each module can
+run. As long as you are in a supported environment, you can run KVM-PR or KVM-HV
+nested. Combinations not shown in the table are not available.
+
++--------------+------------+------+-------------------+----------+--------+
+| Platform | Host type | Bits | Page table format | KVM-HV | KVM-PR |
++==============+============+======+===================+==========+========+
+| PowerNV | bare metal | 32 | hash | no | yes |
+| | | +-------------------+----------+--------+
+| | | | radix | N/A | N/A |
+| | +------+-------------------+----------+--------+
+| | | 64 | hash | yes | yes |
+| | | +-------------------+----------+--------+
+| | | | radix | yes | no |
++--------------+------------+------+-------------------+----------+--------+
+| pSeries [1]_ | PowerNV | 32 | hash | no | yes |
+| | | +-------------------+----------+--------+
+| | | | radix | N/A | N/A |
+| | +------+-------------------+----------+--------+
+| | | 64 | hash | no | yes |
+| | | +-------------------+----------+--------+
+| | | | radix | yes [2]_ | no |
+| +------------+------+-------------------+----------+--------+
+| | PowerVM | 32 | hash | no | yes |
+| | | +-------------------+----------+--------+
+| | | | radix | N/A | N/A |
+| | +------+-------------------+----------+--------+
+| | | 64 | hash | no | yes |
+| | | +-------------------+----------+--------+
+| | | | radix [3]_ | no | yes |
++--------------+------------+------+-------------------+----------+--------+
+
+.. [1] On POWER9 DD2.1 processors, the page table format on the host and guest
+ must be the same.
+
+.. [2] KVM-HV cannot run nested on POWER8 machines.
+
+.. [3] Introduced on Power10 machines.
+
+
+.. _power-papr-protected-execution-facility-pef:
+
+POWER (PAPR) Protected Execution Facility (PEF)
+-----------------------------------------------
+
+Protected Execution Facility (PEF), also known as Secure Guest support
+is a feature found on IBM POWER9 and POWER10 processors.
+
+If a suitable firmware including an Ultravisor is installed, it adds
+an extra memory protection mode to the CPU. The ultravisor manages a
+pool of secure memory which cannot be accessed by the hypervisor.
+
+When this feature is enabled in QEMU, a guest can use ultracalls to
+enter "secure mode". This transfers most of its memory to secure
+memory, where it cannot be eavesdropped by a compromised hypervisor.
+
+Launching
+^^^^^^^^^
+
+To launch a guest which will be permitted to enter PEF secure mode::
+
+ $ qemu-system-ppc64 \
+ -object pef-guest,id=pef0 \
+ -machine confidential-guest-support=pef0 \
+ ...
+
+Live Migration
+^^^^^^^^^^^^^^
+
+Live migration is not yet implemented for PEF guests. For
+consistency, QEMU currently prevents migration if the PEF feature is
+enabled, whether or not the guest has actually entered secure mode.
+
+
+Maintainer contact information
+==============================
+
+Cédric Le Goater <clg@kaod.org>
+
+Daniel Henrique Barboza <danielhb413@gmail.com>
+
+.. [LoPAR] `Linux on Power Architecture Reference document (LoPAR) revision
+ 2.9 <https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200812.pdf>`_.
diff --git a/docs/system/qemu-block-drivers.rst.inc b/docs/system/qemu-block-drivers.rst.inc
index 16225710eb..384e95ba76 100644
--- a/docs/system/qemu-block-drivers.rst.inc
+++ b/docs/system/qemu-block-drivers.rst.inc
@@ -430,6 +430,12 @@ Hard disks
you may corrupt your host data (use the ``-snapshot`` command
line option or modify the device permissions accordingly).
+Zoned block devices
+ Zoned block devices can be passed through to the guest if the emulated storage
+ controller supports zoned storage. Use ``--blockdev host_device,
+ node-name=drive0,filename=/dev/nullb0,cache.direct=on`` to pass through
+ ``/dev/nullb0`` as ``drive0``.
+
Windows
^^^^^^^
@@ -511,13 +517,13 @@ of an inet socket:
|qemu_system| linux.img -hdb nbd+unix://?socket=/tmp/my_socket
-In this case, the block device must be exported using qemu-nbd:
+In this case, the block device must be exported using ``qemu-nbd``:
.. parsed-literal::
qemu-nbd --socket=/tmp/my_socket my_disk.qcow2
-The use of qemu-nbd allows sharing of a disk between several guests:
+The use of ``qemu-nbd`` allows sharing of a disk between several guests:
.. parsed-literal::
@@ -530,7 +536,7 @@ and then you can use it with two guests:
|qemu_system| linux1.img -hdb nbd+unix://?socket=/tmp/my_socket
|qemu_system| linux2.img -hdb nbd+unix://?socket=/tmp/my_socket
-If the nbd-server uses named exports (supported since NBD 2.9.18, or with QEMU's
+If the ``nbd-server`` uses named exports (supported since NBD 2.9.18, or with QEMU's
own embedded NBD server), you must specify an export name in the URI:
.. parsed-literal::
@@ -731,7 +737,6 @@ Examples
|qemu_system| -drive file=gluster+tcp://[1:2:3:4:5:6:7:8]:24007/testvol/dir/a.img
|qemu_system| -drive file=gluster+tcp://server.domain.com:24007/testvol/dir/a.img
|qemu_system| -drive file=gluster+unix:///testvol/dir/a.img?socket=/tmp/glusterd.socket
- |qemu_system| -drive file=gluster+rdma://1.2.3.4:24007/testvol/a.img
|qemu_system| -drive file=gluster://1.2.3.4/testvol/a.img,file.debug=9,file.logfile=/var/log/qemu-gluster.log
|qemu_system| 'json:{"driver":"qcow2",
"file":{"driver":"gluster",
@@ -778,10 +783,32 @@ The optional *HOST_KEY_CHECK* parameter controls how the remote
host's key is checked. The default is ``yes`` which means to use
the local ``.ssh/known_hosts`` file. Setting this to ``no``
turns off known-hosts checking. Or you can check that the host key
-matches a specific fingerprint:
-``host_key_check=md5:78:45:8e:14:57:4f:d5:45:83:0a:0e:f3:49:82:c9:c8``
-(``sha1:`` can also be used as a prefix, but note that OpenSSH
-tools only use MD5 to print fingerprints).
+matches a specific fingerprint. The fingerprint can be provided in
+``md5``, ``sha1``, or ``sha256`` format, however, it is strongly
+recommended to only use ``sha256``, since the other options are
+considered insecure by modern standards. The fingerprint value
+must be given as a hex encoded string::
+
+ host_key_check=sha256:04ce2ae89ff4295a6b9c4111640bdcb3297858ee55cb434d9dd88796e93aa795
+
+The key string may optionally contain ":" separators between
+each pair of hex digits.
+
+The ``$HOME/.ssh/known_hosts`` file contains the base64 encoded
+host keys. These can be converted into the format needed for
+QEMU using a command such as::
+
+ $ for key in `grep 10.33.8.112 known_hosts | awk '{print $3}'`
+ do
+ echo $key | base64 -d | sha256sum
+ done
+ 6c3aa525beda9dc83eadfbd7e5ba7d976ecb59575d1633c87cd06ed2ed6e366f -
+ 12214fd9ea5b408086f98ecccd9958609bd9ac7c0ea316734006bc7818b45dc8 -
+ d36420137bcbd101209ef70c3b15dc07362fbe0fa53c5b135eba6e6afa82f0ce -
+
+Note that there can be multiple keys present per host, each with
+different key ciphers. Care is needed to pick the key fingerprint
+that matches the cipher QEMU will negotiate with the remote server.
Currently authentication must be done using ssh-agent. Other
authentication methods may be supported in future.
diff --git a/docs/system/qemu-manpage.rst b/docs/system/qemu-manpage.rst
index c47a412758..3ade4ee45b 100644
--- a/docs/system/qemu-manpage.rst
+++ b/docs/system/qemu-manpage.rst
@@ -31,6 +31,11 @@ Options
disk_image is a raw hard disk image for IDE hard disk 0. Some targets do
not need a disk image.
+When dealing with options parameters as arbitrary strings containing
+commas, such as in "file=my,file" and "string=a,b", it's necessary to
+double the commas. For instance,"-fw_cfg name=z,string=a,,b" will be
+parsed as "-fw_cfg name=z,string=a,b".
+
.. hxtool-doc:: qemu-options.hx
.. include:: keys.rst.inc
diff --git a/docs/system/quickstart.rst b/docs/system/quickstart.rst
deleted file mode 100644
index 681678c86e..0000000000
--- a/docs/system/quickstart.rst
+++ /dev/null
@@ -1,21 +0,0 @@
-.. _pcsys_005fquickstart:
-
-Quick Start
------------
-
-Download and uncompress a PC hard disk image with Linux installed (e.g.
-``linux.img``) and type:
-
-.. parsed-literal::
-
- |qemu_system| linux.img
-
-Linux should boot and give you a prompt.
-
-Users should be aware the above example elides a lot of the complexity
-of setting up a VM with x86_64 specific defaults and assumes the
-first non switch argument is a PC compatible disk image with a boot
-sector. For a non-x86 system where we emulate a broad range of machine
-types, the command lines are generally more explicit in defining the
-machine and boot behaviour. You will find more example command lines
-in the :ref:`system-targets-ref` section of the manual.
diff --git a/docs/system/replay.rst b/docs/system/replay.rst
new file mode 100644
index 0000000000..28e5772a2b
--- /dev/null
+++ b/docs/system/replay.rst
@@ -0,0 +1,237 @@
+.. _replay:
+
+..
+ Copyright (c) 2010-2022 Institute for System Programming
+ of the Russian Academy of Sciences.
+
+ This work is licensed under the terms of the GNU GPL, version 2 or later.
+ See the COPYING file in the top-level directory.
+
+Record/replay
+=============
+
+Record/replay functions are used for the deterministic replay of qemu execution.
+Execution recording writes a non-deterministic events log, which can be later
+used for replaying the execution anywhere and for unlimited number of times.
+It also supports checkpointing for faster rewind to the specific replay moment.
+Execution replaying reads the log and replays all non-deterministic events
+including external input, hardware clocks, and interrupts.
+
+Deterministic replay has the following features:
+
+ * Deterministically replays whole system execution and all contents of
+ the memory, state of the hardware devices, clocks, and screen of the VM.
+ * Writes execution log into the file for later replaying for multiple times
+ on different machines.
+ * Supports i386, x86_64, ARM, AArch64, Risc-V, MIPS, MIPS64, S390X, Alpha,
+ PowerPC, PowerPC64, M68000, Microblaze, OpenRISC, SPARC,
+ and Xtensa hardware platforms.
+ * Performs deterministic replay of all operations with keyboard and mouse
+ input devices, serial ports, and network.
+
+Usage of the record/replay:
+
+ * First, record the execution with the following command line:
+
+ .. parsed-literal::
+ |qemu_system| \\
+ -icount shift=auto,rr=record,rrfile=replay.bin \\
+ -drive file=disk.qcow2,if=none,snapshot,id=img-direct \\
+ -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \\
+ -device ide-hd,drive=img-blkreplay \\
+ -netdev user,id=net1 -device rtl8139,netdev=net1 \\
+ -object filter-replay,id=replay,netdev=net1
+
+ * After recording, you can replay it by using another command line:
+
+ .. parsed-literal::
+ |qemu_system| \\
+ -icount shift=auto,rr=replay,rrfile=replay.bin \\
+ -drive file=disk.qcow2,if=none,snapshot,id=img-direct \\
+ -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay \\
+ -device ide-hd,drive=img-blkreplay \\
+ -netdev user,id=net1 -device rtl8139,netdev=net1 \\
+ -object filter-replay,id=replay,netdev=net1
+
+ The only difference with recording is changing the rr option
+ from record to replay.
+ * Block device images are not actually changed in the recording mode,
+ because all of the changes are written to the temporary overlay file.
+ This behavior is enabled by using blkreplay driver. It should be used
+ for every enabled block device, as described in :ref:`block-label` section.
+ * ``-net none`` option should be specified when network is not used,
+ because QEMU adds network card by default. When network is needed,
+ it should be configured explicitly with replay filter, as described
+ in :ref:`network-label` section.
+ * Interaction with audio devices and serial ports are recorded and replayed
+ automatically when such devices are enabled.
+
+Core idea
+---------
+
+Record/replay system is based on saving and replaying non-deterministic
+events (e.g. keyboard input) and simulating deterministic ones (e.g. reading
+from HDD or memory of the VM). Saving only non-deterministic events makes
+log file smaller and simulation faster.
+
+The following non-deterministic data from peripheral devices is saved into
+the log: mouse and keyboard input, network packets, audio controller input,
+serial port input, and hardware clocks (they are non-deterministic
+too, because their values are taken from the host machine). Inputs from
+simulated hardware, memory of VM, software interrupts, and execution of
+instructions are not saved into the log, because they are deterministic and
+can be replayed by simulating the behavior of virtual machine starting from
+initial state.
+
+Instruction counting
+--------------------
+
+QEMU should work in icount mode to use record/replay feature. icount was
+designed to allow deterministic execution in absence of external inputs
+of the virtual machine. Record/replay feature is enabled through ``-icount``
+command-line option, making possible deterministic execution of the machine,
+interacting with user or network.
+
+.. _block-label:
+
+Block devices
+-------------
+
+Block devices record/replay module intercepts calls of
+bdrv coroutine functions at the top of block drivers stack.
+To record and replay block operations the drive must be configured
+as following:
+
+.. parsed-literal::
+ -drive file=disk.qcow2,if=none,snapshot,id=img-direct
+ -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
+ -device ide-hd,drive=img-blkreplay
+
+blkreplay driver should be inserted between disk image and virtual driver
+controller. Therefore all disk requests may be recorded and replayed.
+
+.. _snapshotting-label:
+
+Snapshotting
+------------
+
+New VM snapshots may be created in replay mode. They can be used later
+to recover the desired VM state. All VM states created in replay mode
+are associated with the moment of time in the replay scenario.
+After recovering the VM state replay will start from that position.
+
+Default starting snapshot name may be specified with icount field
+rrsnapshot as follows:
+
+.. parsed-literal::
+ -icount shift=auto,rr=record,rrfile=replay.bin,rrsnapshot=snapshot_name
+
+This snapshot is created at start of recording and restored at start
+of replaying. It also can be loaded while replaying to roll back
+the execution.
+
+``snapshot`` flag of the disk image must be removed to save the snapshots
+in the overlay (or original image) instead of using the temporary overlay.
+
+.. parsed-literal::
+ -drive file=disk.ovl,if=none,id=img-direct
+ -drive driver=blkreplay,if=none,image=img-direct,id=img-blkreplay
+ -device ide-hd,drive=img-blkreplay
+
+Use QEMU monitor to create additional snapshots. ``savevm <name>`` command
+created the snapshot and ``loadvm <name>`` restores it. To prevent corruption
+of the original disk image, use overlay files linked to the original images.
+Therefore all new snapshots (including the starting one) will be saved in
+overlays and the original image remains unchanged.
+
+When you need to use snapshots with diskless virtual machine,
+it must be started with "orphan" qcow2 image. This image will be used
+for storing VM snapshots. Here is the example of the command line for this:
+
+.. parsed-literal::
+ |qemu_system| \\
+ -icount shift=auto,rr=replay,rrfile=record.bin,rrsnapshot=init \\
+ -net none -drive file=empty.qcow2,if=none,id=rr
+
+``empty.qcow2`` drive does not connected to any virtual block device and used
+for VM snapshots only.
+
+.. _network-label:
+
+Network devices
+---------------
+
+Record and replay for network interactions is performed with the network filter.
+Each backend must have its own instance of the replay filter as follows:
+
+.. parsed-literal::
+ -netdev user,id=net1 -device rtl8139,netdev=net1
+ -object filter-replay,id=replay,netdev=net1
+
+Replay network filter is used to record and replay network packets. While
+recording the virtual machine this filter puts all packets coming from
+the outer world into the log. In replay mode packets from the log are
+injected into the network device. All interactions with network backend
+in replay mode are disabled.
+
+Audio devices
+-------------
+
+Audio data is recorded and replay automatically. The command line for recording
+and replaying must contain identical specifications of audio hardware, e.g.:
+
+.. parsed-literal::
+ -audio pa,model=ac97
+
+Serial ports
+------------
+
+Serial ports input is recorded and replay automatically. The command lines
+for recording and replaying must contain identical number of ports in record
+and replay modes, but their backends may differ.
+E.g., ``-serial stdio`` in record mode, and ``-serial null`` in replay mode.
+
+Reverse debugging
+-----------------
+
+Reverse debugging allows "executing" the program in reverse direction.
+GDB remote protocol supports "reverse step" and "reverse continue"
+commands. The first one steps single instruction backwards in time,
+and the second one finds the last breakpoint in the past.
+
+Recorded executions may be used to enable reverse debugging. QEMU can't
+execute the code in backwards direction, but can load a snapshot and
+replay forward to find the desired position or breakpoint.
+
+The following GDB commands are supported:
+
+ - ``reverse-stepi`` (or ``rsi``) - step one instruction backwards
+ - ``reverse-continue`` (or ``rc``) - find last breakpoint in the past
+
+Reverse step loads the nearest snapshot and replays the execution until
+the required instruction is met.
+
+Reverse continue may include several passes of examining the execution
+between the snapshots. Each of the passes include the following steps:
+
+ #. loading the snapshot
+ #. replaying to examine the breakpoints
+ #. if breakpoint or watchpoint was met
+
+ * loading the snapshot again
+ * replaying to the required breakpoint
+
+ #. else
+
+ * proceeding to the p.1 with the earlier snapshot
+
+Therefore usage of the reverse debugging requires at least one snapshot
+created. This can be done by omitting ``snapshot`` option
+for the block drives and adding ``rrsnapshot`` for both record and replay
+command lines.
+See the :ref:`snapshotting-label` section to learn more about running record/replay
+and creating the snapshot in these modes.
+
+When ``rrsnapshot`` is not used, then snapshot named ``start_debugging``
+created in temporary overlay. This allows using reverse debugging, but with
+temporary snapshots (existing within the session).
diff --git a/docs/system/riscv/shakti-c.rst b/docs/system/riscv/shakti-c.rst
index a6035d42b0..fea57f7b6b 100644
--- a/docs/system/riscv/shakti-c.rst
+++ b/docs/system/riscv/shakti-c.rst
@@ -45,7 +45,7 @@ Shakti SDK can be used to generate the baremetal example UART applications.
Binary would be generated in:
software/examples/uart_applns/loopback/output/loopback.shakti
-You could also download the precompiled example applicatons using below
+You could also download the precompiled example applications using below
commands.
.. code-block:: bash
diff --git a/docs/system/riscv/sifive_u.rst b/docs/system/riscv/sifive_u.rst
index 7b166567f9..8f55ae8e31 100644
--- a/docs/system/riscv/sifive_u.rst
+++ b/docs/system/riscv/sifive_u.rst
@@ -210,7 +210,7 @@ command line options with ``qemu-system-riscv32``.
Running U-Boot
--------------
-U-Boot mainline v2021.07 release is tested at the time of writing. To build a
+U-Boot mainline v2024.01 release is tested at the time of writing. To build a
U-Boot mainline bootloader that can be booted by the ``sifive_u`` machine, use
the sifive_unleashed_defconfig with similar commands as described above for
Linux:
@@ -325,15 +325,10 @@ configuration of U-Boot:
$ export CROSS_COMPILE=riscv64-linux-
$ make sifive_unleashed_defconfig
- $ make menuconfig
-
-then manually select the following configuration:
-
- * Device Tree Control ---> Provider of DTB for DT Control ---> Prior Stage bootloader DTB
-
-and unselect the following configuration:
-
- * Library routines ---> Allow access to binman information in the device tree
+ $ ./scripts/config --enable OF_BOARD
+ $ ./scripts/config --disable BINMAN_FDT
+ $ ./scripts/config --disable SPL
+ $ make olddefconfig
This changes U-Boot to use the QEMU generated device tree blob, and bypass
running the U-Boot SPL stage.
@@ -352,17 +347,13 @@ It's possible to create a 32-bit U-Boot S-mode image as well.
$ export CROSS_COMPILE=riscv64-linux-
$ make sifive_unleashed_defconfig
- $ make menuconfig
-
-then manually update the following configuration in U-Boot:
-
- * Device Tree Control ---> Provider of DTB for DT Control ---> Prior Stage bootloader DTB
- * RISC-V architecture ---> Base ISA ---> RV32I
- * Boot options ---> Boot images ---> Text Base ---> 0x80400000
-
-and unselect the following configuration:
-
- * Library routines ---> Allow access to binman information in the device tree
+ $ ./scripts/config --disable ARCH_RV64I
+ $ ./scripts/config --enable ARCH_RV32I
+ $ ./scripts/config --set-val TEXT_BASE 0x80400000
+ $ ./scripts/config --enable OF_BOARD
+ $ ./scripts/config --disable BINMAN_FDT
+ $ ./scripts/config --disable SPL
+ $ make olddefconfig
Use the same command line options to boot the 32-bit U-Boot S-mode image:
diff --git a/docs/system/riscv/virt.rst b/docs/system/riscv/virt.rst
index fa016584bf..9a06f95a34 100644
--- a/docs/system/riscv/virt.rst
+++ b/docs/system/riscv/virt.rst
@@ -12,7 +12,7 @@ Supported devices
The ``virt`` machine supports the following devices:
-* Up to 8 generic RV32GC/RV64GC cores, with optional extensions
+* Up to 512 generic RV32GC/RV64GC cores, with optional extensions
* Core Local Interruptor (CLINT)
* Platform-Level Interrupt Controller (PLIC)
* CFI parallel NOR flash memory
@@ -23,9 +23,9 @@ The ``virt`` machine supports the following devices:
* 1 generic PCIe host bridge
* The fw_cfg device that allows a guest to obtain data from QEMU
-Note that the default CPU is a generic RV32GC/RV64GC. Optional extensions
-can be enabled via command line parameters, e.g.: ``-cpu rv64,x-h=true``
-enables the hypervisor extension for RV64.
+The hypervisor extension has been enabled for the default CPU, so virtual
+machines with hypervisor extension can simply be used without explicitly
+declaring.
Hardware configuration information
----------------------------------
@@ -53,6 +53,37 @@ with the default OpenSBI firmware image as the -bios. It also supports
the recommended RISC-V bootflow: U-Boot SPL (M-mode) loads OpenSBI fw_dynamic
firmware and U-Boot proper (S-mode), using the standard -bios functionality.
+Using flash devices
+-------------------
+
+By default, the first flash device (pflash0) is expected to contain
+S-mode firmware code. It can be configured as read-only, with the
+second flash device (pflash1) available to store configuration data.
+
+For example, booting edk2 looks like
+
+.. code-block:: bash
+
+ $ qemu-system-riscv64 \
+ -blockdev node-name=pflash0,driver=file,read-only=on,filename=<edk2_code> \
+ -blockdev node-name=pflash1,driver=file,filename=<edk2_vars> \
+ -M virt,pflash0=pflash0,pflash1=pflash1 \
+ ... other args ....
+
+For TCG guests only, it is also possible to boot M-mode firmware from
+the first flash device (pflash0) by additionally passing ``-bios
+none``, as in
+
+.. code-block:: bash
+
+ $ qemu-system-riscv64 \
+ -bios none \
+ -blockdev node-name=pflash0,driver=file,read-only=on,filename=<m_mode_code> \
+ -M virt,pflash0=pflash0 \
+ ... other args ....
+
+Firmware images used for pflash must be exactly 32 MiB in size.
+
Machine-specific options
------------------------
@@ -62,6 +93,28 @@ The following machine-specific options are supported:
When this option is "on", ACLINT devices will be emulated instead of
SiFive CLINT. When not specified, this option is assumed to be "off".
+ This option is restricted to the TCG accelerator.
+
+- acpi=[on|off|auto]
+
+ When this option is "on" (which is the default), ACPI tables are generated and
+ exposed as firmware tables etc/acpi/rsdp and etc/acpi/tables.
+
+- aia=[none|aplic|aplic-imsic]
+
+ This option allows selecting interrupt controller defined by the AIA
+ (advanced interrupt architecture) specification. The "aia=aplic" selects
+ APLIC (advanced platform level interrupt controller) to handle wired
+ interrupts whereas the "aia=aplic-imsic" selects APLIC and IMSIC (incoming
+ message signaled interrupt controller) to handle both wired interrupts and
+ MSIs. When not specified, this option is assumed to be "none" which selects
+ SiFive PLIC to handle wired interrupts.
+
+- aia-guests=nnn
+
+ The number of per-HART VS-level AIA IMSIC pages to be emulated for a guest
+ having AIA IMSIC (i.e. "aia=aplic-imsic" selected). When not specified,
+ the default number of per-HART VS-level AIA IMSIC pages is 0.
Running Linux kernel
--------------------
@@ -146,3 +199,28 @@ The minimal QEMU commands to run U-Boot SPL are:
To test 32-bit U-Boot images, switch to use qemu-riscv32_smode_defconfig and
riscv32_spl_defconfig builds, and replace ``qemu-system-riscv64`` with
``qemu-system-riscv32`` in the command lines above to boot the 32-bit U-Boot.
+
+Enabling TPM
+------------
+
+A TPM device can be connected to the virt board by following the steps below.
+
+First launch the TPM emulator:
+
+.. code-block:: bash
+
+ $ swtpm socket --tpm2 -t -d --tpmstate dir=/tmp/tpm \
+ --ctrl type=unixio,path=swtpm-sock
+
+Then launch QEMU with some additional arguments to link a TPM device to the backend:
+
+.. code-block:: bash
+
+ $ qemu-system-riscv64 \
+ ... other args .... \
+ -chardev socket,id=chrtpm,path=swtpm-sock \
+ -tpmdev emulator,id=tpm0,chardev=chrtpm \
+ -device tpm-tis-device,tpmdev=tpm0
+
+The TPM device can be seen in the memory tree and the generated device
+tree and should be accessible from the guest software.
diff --git a/docs/system/s390x/bootdevices.rst b/docs/system/s390x/bootdevices.rst
index 9e591cb9dc..1a7a18b43b 100644
--- a/docs/system/s390x/bootdevices.rst
+++ b/docs/system/s390x/bootdevices.rst
@@ -53,6 +53,32 @@ recommended to specify a CD-ROM device via ``-device scsi-cd`` (as mentioned
above) instead.
+Selecting kernels with the ``loadparm`` property
+------------------------------------------------
+
+The ``s390-ccw-virtio`` machine supports the so-called ``loadparm`` parameter
+which can be used to select the kernel on the disk of the guest that the
+s390-ccw bios should boot. When starting QEMU, it can be specified like this::
+
+ qemu-system-s390x -machine s390-ccw-virtio,loadparm=<string>
+
+The first way to use this parameter is to use the word ``PROMPT`` as the
+``<string>`` here. In that case the s390-ccw bios will show a list of
+installed kernels on the disk of the guest and ask the user to enter a number
+to chose which kernel should be booted -- similar to what can be achieved by
+specifying the ``-boot menu=on`` option when starting QEMU. Note that the menu
+list will only show the names of the installed kernels when using a DASD-like
+disk image with 4k byte sectors. On normal SCSI-style disks with 512-byte
+sectors, there is not enough space for the zipl loader on the disk to store
+the kernel names, so you only get a list without names here.
+
+The second way to use this parameter is to use a number in the range from 0
+to 31. The numbers that can be used here correspond to the numbers that are
+shown when using the ``PROMPT`` option, and the s390-ccw bios will then try
+to automatically boot the kernel that is associated with the given number.
+Note that ``0`` can be used to boot the default entry.
+
+
Booting from a network device
-----------------------------
@@ -65,7 +91,7 @@ you can specify it via the ``-global s390-ipl.netboot_fw=filename``
command line option.
The ``bootindex`` property is especially important for booting via the network.
-If you don't specify the the ``bootindex`` property here, the network bootloader
+If you don't specify the ``bootindex`` property here, the network bootloader
firmware code won't get loaded into the guest memory so that the network boot
will fail. For a successful network boot, try something like this::
diff --git a/docs/system/s390x/cpu-topology.rst b/docs/system/s390x/cpu-topology.rst
new file mode 100644
index 0000000000..d5b506ee5c
--- /dev/null
+++ b/docs/system/s390x/cpu-topology.rst
@@ -0,0 +1,246 @@
+.. _cpu-topology-s390x:
+
+CPU topology on s390x
+=====================
+
+Since QEMU 8.2, CPU topology on s390x provides up to 3 levels of
+topology containers: drawers, books and sockets. They define a
+tree-shaped hierarchy.
+
+The socket container has one or more CPU entries.
+Each of these CPU entries consists of a bitmap and three CPU attributes:
+
+- CPU type
+- entitlement
+- dedication
+
+Each bit set in the bitmap correspond to a core-id of a vCPU with matching
+attributes.
+
+This documentation provides general information on S390 CPU topology,
+how to enable it and explains the new CPU attributes.
+For information on how to modify the S390 CPU topology and how to
+monitor polarization changes, see ``docs/devel/s390-cpu-topology.rst``.
+
+Prerequisites
+-------------
+
+To use the CPU topology, you currently need to choose the KVM accelerator.
+See :ref:`Accelerators` for more details about accelerators and how to select them.
+
+The s390x host needs to use a Linux kernel v6.0 or newer (which provides the so-called
+``KVM_CAP_S390_CPU_TOPOLOGY`` capability that allows QEMU to signal the
+CPU topology facility via the so-called STFLE bit 11 to the VM).
+
+Enabling CPU topology
+---------------------
+
+Currently, CPU topology is enabled by default only in the "host" CPU model.
+
+Enabling CPU topology in another CPU model is done by setting the CPU flag
+``ctop`` to ``on`` as in:
+
+.. code-block:: bash
+
+ -cpu gen16b,ctop=on
+
+Having the topology disabled by default allows migration between
+old and new QEMU without adding new flags.
+
+Default topology usage
+----------------------
+
+The CPU topology can be specified on the QEMU command line
+with the ``-smp`` or the ``-device`` QEMU command arguments.
+
+Note also that since 7.2 threads are no longer supported in the topology
+and the ``-smp`` command line argument accepts only ``threads=1``.
+
+If none of the containers attributes (drawers, books, sockets) are
+specified for the ``-smp`` flag, the number of these containers
+is 1.
+
+Thus the following two options will result in the same topology:
+
+.. code-block:: bash
+
+ -smp cpus=5,drawer=1,books=1,sockets=8,cores=4,maxcpus=32
+
+and
+
+.. code-block:: bash
+
+ -smp cpus=5,sockets=8,cores=4,maxcpus=32
+
+When a CPU is defined by the ``-smp`` command argument, its position
+inside the topology is calculated by adding the CPUs to the topology
+based on the core-id starting with core-0 at position 0 of socket-0,
+book-0, drawer-0 and filling all CPUs of socket-0 before filling socket-1
+of book-0 and so on up to the last socket of the last book of the last
+drawer.
+
+When a CPU is defined by the ``-device`` command argument, the
+tree topology attributes must all be defined or all not defined.
+
+.. code-block:: bash
+
+ -device gen16b-s390x-cpu,drawer-id=1,book-id=1,socket-id=2,core-id=1
+
+or
+
+.. code-block:: bash
+
+ -device gen16b-s390x-cpu,core-id=1,dedicated=true
+
+If none of the tree attributes (drawer, book, sockets), are specified
+for the ``-device`` argument, like for all CPUs defined with the ``-smp``
+command argument the topology tree attributes will be set by simply
+adding the CPUs to the topology based on the core-id.
+
+QEMU will not try to resolve collisions and will report an error if the
+CPU topology defined explicitly or implicitly on a ``-device``
+argument collides with the definition of a CPU implicitly defined
+on the ``-smp`` argument.
+
+When the topology modifier attributes are not defined for the
+``-device`` command argument they takes following default values:
+
+- dedicated: ``false``
+- entitlement: ``medium``
+
+
+Hot plug
+++++++++
+
+New CPUs can be plugged using the device_add hmp command as in:
+
+.. code-block:: bash
+
+ (qemu) device_add gen16b-s390x-cpu,core-id=9
+
+The placement of the CPU is derived from the core-id as described above.
+
+The topology can of course also be fully defined:
+
+.. code-block:: bash
+
+ (qemu) device_add gen16b-s390x-cpu,drawer-id=1,book-id=1,socket-id=2,core-id=1
+
+
+Examples
+++++++++
+
+In the following machine we define 8 sockets with 4 cores each.
+
+.. code-block:: bash
+
+ $ qemu-system-s390x -accel kvm -m 2G \
+ -cpu gen16b,ctop=on \
+ -smp cpus=5,sockets=8,cores=4,maxcpus=32 \
+ -device host-s390x-cpu,core-id=14 \
+
+A new CPUs can be plugged using the device_add hmp command as before:
+
+.. code-block:: bash
+
+ (qemu) device_add gen16b-s390x-cpu,core-id=9
+
+The core-id defines the placement of the core in the topology by
+starting with core 0 in socket 0 up to maxcpus.
+
+In the example above:
+
+* There are 5 CPUs provided to the guest with the ``-smp`` command line
+ They will take the core-ids 0,1,2,3,4
+ As we have 4 cores in a socket, we have 4 CPUs provided
+ to the guest in socket 0, with core-ids 0,1,2,3.
+ The last CPU, with core-id 4, will be on socket 1.
+
+* the core with ID 14 provided by the ``-device`` command line will
+ be placed in socket 3, with core-id 14
+
+* the core with ID 9 provided by the ``device_add`` qmp command will
+ be placed in socket 2, with core-id 9
+
+
+Polarization, entitlement and dedication
+----------------------------------------
+
+Polarization
+++++++++++++
+
+The polarization affects how the CPUs of a shared host are utilized/distributed
+among guests.
+The guest determines the polarization by using the PTF instruction.
+
+Polarization defines two models of CPU provisioning: horizontal
+and vertical.
+
+The horizontal polarization is the default model on boot and after
+subsystem reset. When horizontal polarization is in effect all vCPUs should
+have about equal resource provisioning.
+
+In the vertical polarization model vCPUs are unequal, but overall more resources
+might be available.
+The guest can make use of the vCPU entitlement information provided by the host
+to optimize kernel thread scheduling.
+
+A subsystem reset puts all vCPU of the configuration into the
+horizontal polarization.
+
+Entitlement
++++++++++++
+
+The vertical polarization specifies that the guest's vCPU can get
+different real CPU provisioning:
+
+- a vCPU with vertical high entitlement specifies that this
+ vCPU gets 100% of the real CPU provisioning.
+
+- a vCPU with vertical medium entitlement specifies that this
+ vCPU shares the real CPU with other vCPUs.
+
+- a vCPU with vertical low entitlement specifies that this
+ vCPU only gets real CPU provisioning when no other vCPUs needs it.
+
+In the case a vCPU with vertical high entitlement does not use
+the real CPU, the unused "slack" can be dispatched to other vCPU
+with medium or low entitlement.
+
+A vCPU can be "dedicated" in which case the vCPU is fully dedicated to a single
+real CPU.
+
+The dedicated bit is an indication of affinity of a vCPU for a real CPU
+while the entitlement indicates the sharing or exclusivity of use.
+
+Defining the topology on the command line
+-----------------------------------------
+
+The topology can entirely be defined using -device cpu statements,
+with the exception of CPU 0 which must be defined with the -smp
+argument.
+
+For example, here we set the position of the cores 1,2,3 to
+drawer 1, book 1, socket 2 and cores 0,9 and 14 to drawer 0,
+book 0, socket 0 without defining entitlement or dedication.
+Core 4 will be set on its default position on socket 1
+(since we have 4 core per socket) and we define it as dedicated and
+with vertical high entitlement.
+
+.. code-block:: bash
+
+ $ qemu-system-s390x -accel kvm -m 2G \
+ -cpu gen16b,ctop=on \
+ -smp cpus=1,sockets=8,cores=4,maxcpus=32 \
+ \
+ -device gen16b-s390x-cpu,drawer-id=1,book-id=1,socket-id=2,core-id=1 \
+ -device gen16b-s390x-cpu,drawer-id=1,book-id=1,socket-id=2,core-id=2 \
+ -device gen16b-s390x-cpu,drawer-id=1,book-id=1,socket-id=2,core-id=3 \
+ \
+ -device gen16b-s390x-cpu,drawer-id=0,book-id=0,socket-id=0,core-id=9 \
+ -device gen16b-s390x-cpu,drawer-id=0,book-id=0,socket-id=0,core-id=14 \
+ \
+ -device gen16b-s390x-cpu,core-id=4,dedicated=on,entitlement=high
+
+The entitlement defined for the CPU 4 will only be used after the guest
+successfully enables vertical polarization by using the PTF instruction.
diff --git a/docs/system/s390x/pcidevices.rst b/docs/system/s390x/pcidevices.rst
new file mode 100644
index 0000000000..628effa2f4
--- /dev/null
+++ b/docs/system/s390x/pcidevices.rst
@@ -0,0 +1,41 @@
+PCI devices on s390x
+====================
+
+PCI devices on s390x work differently than on other architectures and need to
+be configured in a slightly different way.
+
+Every PCI device is linked with an additional ``zpci`` device.
+While the ``zpci`` device will be autogenerated if not specified, it is
+recommended to specify it explicitly so that you can pass s390-specific
+PCI configuration.
+
+For example, in order to pass a PCI device ``0000:00:00.0`` through to the
+guest, you would specify::
+
+ qemu-system-s390x ... \
+ -device zpci,uid=1,fid=0,target=hostdev0,id=zpci1 \
+ -device vfio-pci,host=0000:00:00.0,id=hostdev0
+
+Here, the zpci device is joined with the PCI device via the ``target`` property.
+
+Note that we don't set bus, slot or function here for the guest as is common in
+other PCI implementations. Topology information is not available on s390x, and
+the guest will not see any of the bus, slot or function information specified
+on the command line.
+
+Instead, ``uid`` and ``fid`` determine how the device is presented to the guest
+operating system.
+
+In case of Linux, ``uid`` will be used in the ``domain`` part of the PCI
+identifier, and ``fid`` identifies the physical slot, i.e.::
+
+ qemu-system-s390x ... \
+ -device zpci,uid=7,fid=8,target=hostdev0,id=zpci1 \
+ ...
+
+will be presented in the guest as::
+
+ # lspci -v
+ 0007:00:00.0 ...
+ Physical Slot: 00000008
+ ...
diff --git a/docs/system/target-arm.rst b/docs/system/target-arm.rst
index 91ebc26c6d..c9d7c0dda7 100644
--- a/docs/system/target-arm.rst
+++ b/docs/system/target-arm.rst
@@ -83,6 +83,8 @@ undocumented; you can get a complete list by running
arm/versatile
arm/vexpress
arm/aspeed
+ arm/bananapi_m2u.rst
+ arm/b-l475e-iot01a.rst
arm/sabrelite
arm/digic
arm/cubieboard
@@ -106,6 +108,7 @@ undocumented; you can get a complete list by running
arm/stm32
arm/virt
arm/xlnx-versal-virt
+ arm/xenpvh
Emulated CPU architecture support
=================================
diff --git a/docs/system/target-i386-desc.rst.inc b/docs/system/target-i386-desc.rst.inc
index 7d1fffacbe..319e540573 100644
--- a/docs/system/target-i386-desc.rst.inc
+++ b/docs/system/target-i386-desc.rst.inc
@@ -36,7 +36,7 @@ The QEMU PC System emulator simulates the following peripherals:
- PCI UHCI, OHCI, EHCI or XHCI USB controller and a virtual USB-1.1
hub.
-SMP is supported with up to 255 CPUs.
+SMP is supported with up to 255 CPUs (and 4096 CPUs for PC Q35 machine).
QEMU uses the PC BIOS from the Seabios project and the Plex86/Bochs LGPL
VGA BIOS.
@@ -71,3 +71,11 @@ machine property, i.e.
|qemu_system_x86| some.img \
-audiodev <backend>,id=<name> \
-machine pcspk-audiodev=<name>
+
+Machine-specific options
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+It supports the following machine-specific options:
+
+- ``x-south-bridge=PIIX3|piix4-isa`` (Experimental option to select a particular
+ south bridge. Default: ``PIIX3``)
diff --git a/docs/system/target-i386.rst b/docs/system/target-i386.rst
index c9720a8cd1..1b8a1f248a 100644
--- a/docs/system/target-i386.rst
+++ b/docs/system/target-i386.rst
@@ -3,8 +3,6 @@
x86 System emulator
-------------------
-.. _pcsys_005fdevices:
-
Board-specific documentation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -26,8 +24,11 @@ Architectural features
:maxdepth: 1
i386/cpu
-
-.. _pcsys_005freq:
+ i386/hyperv
+ i386/xen
+ i386/kvm-pv
+ i386/sgx
+ i386/amd-memory-encryption
OS requirements
~~~~~~~~~~~~~~~
diff --git a/docs/system/target-mips.rst b/docs/system/target-mips.rst
index 138441bdec..83239fb9df 100644
--- a/docs/system/target-mips.rst
+++ b/docs/system/target-mips.rst
@@ -8,8 +8,6 @@ endian options, ``qemu-system-mips``, ``qemu-system-mipsel``
``qemu-system-mips64`` and ``qemu-system-mips64el``. Five different
machine types are emulated:
-- A generic ISA PC-like machine \"mips\"
-
- The MIPS Malta prototype board \"malta\"
- An ACER Pica \"pica61\". This machine needs the 64-bit emulator.
@@ -19,18 +17,6 @@ machine types are emulated:
- A MIPS Magnum R4000 machine \"magnum\". This machine needs the
64-bit emulator.
-The generic emulation is supported by Debian 'Etch' and is able to
-install Debian into a virtual disk image. The following devices are
-emulated:
-
-- A range of MIPS CPUs, default is the 24Kf
-
-- PC style serial port
-
-- PC style IDE disk
-
-- NE2000 network card
-
The Malta emulation supports the following devices:
- Core board with MIPS 24Kf CPU and Galileo system controller
diff --git a/docs/system/target-openrisc.rst b/docs/system/target-openrisc.rst
new file mode 100644
index 0000000000..22cb2217a6
--- /dev/null
+++ b/docs/system/target-openrisc.rst
@@ -0,0 +1,71 @@
+.. _OpenRISC-System-emulator:
+
+OpenRISC System emulator
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+QEMU can emulate 32-bit OpenRISC CPUs using the ``qemu-system-or1k`` executable.
+
+OpenRISC CPUs are generally built into "system-on-chip" (SoC) designs that run
+on FPGAs. These SoCs are based on the same core architecture as the or1ksim
+(the original OpenRISC instruction level simulator) which QEMU supports. For
+this reason QEMU does not need to support many different boards to support the
+OpenRISC hardware ecosystem.
+
+The OpenRISC CPU supported by QEMU is the ``or1200``, it supports an MMU and can
+run linux.
+
+Choosing a board model
+======================
+
+For QEMU's OpenRISC system emulation, you must specify which board model you
+want to use with the ``-M`` or ``--machine`` option; the default machine is
+``or1k-sim``.
+
+If you intend to boot Linux, it is possible to have a single kernel image that
+will boot on any of the QEMU machines. To do this one would compile all required
+drivers into the kernel. This is possible because QEMU will create a device tree
+structure that describes the QEMU machine and pass a pointer to the structure to
+the kernel. The kernel can then use this to configure itself for the machine.
+
+However, typically users will have specific firmware images for a specific machine.
+
+If you already have a system image or a kernel that works on hardware and you
+want to boot with QEMU, check whether QEMU lists that machine in its ``-machine
+help`` output. If it is listed, then you can probably use that board model. If
+it is not listed, then unfortunately your image will almost certainly not boot
+on QEMU. (You might be able to extract the filesystem and use that with a
+different kernel which boots on a system that QEMU does emulate.)
+
+If you don't care about reproducing the idiosyncrasies of a particular
+bit of hardware, such as small amount of RAM, no PCI or other hard disk, etc.,
+and just want to run Linux, the best option is to use the ``virt`` board. This
+is a platform which doesn't correspond to any real hardware and is designed for
+use in virtual machines. You'll need to compile Linux with a suitable
+configuration for running on the ``virt`` board. ``virt`` supports PCI, virtio
+and large amounts of RAM.
+
+Board-specific documentation
+============================
+
+..
+ This table of contents should be kept sorted alphabetically
+ by the title text of each file, which isn't the same ordering
+ as an alphabetical sort by filename.
+
+.. toctree::
+ :maxdepth: 1
+
+ openrisc/or1k-sim
+ openrisc/virt
+
+Emulated CPU architecture support
+=================================
+
+.. toctree::
+ openrisc/emulation
+
+OpenRISC CPU features
+=====================
+
+.. toctree::
+ openrisc/cpu-features
diff --git a/docs/system/target-ppc.rst b/docs/system/target-ppc.rst
index 4f6eb93b17..87bf412ce5 100644
--- a/docs/system/target-ppc.rst
+++ b/docs/system/target-ppc.rst
@@ -17,6 +17,7 @@ help``.
.. toctree::
:maxdepth: 1
+ ppc/amigang
ppc/embedded
ppc/powermac
ppc/powernv
diff --git a/docs/system/target-riscv.rst b/docs/system/target-riscv.rst
index 89a866e4f4..ba195f1518 100644
--- a/docs/system/target-riscv.rst
+++ b/docs/system/target-riscv.rst
@@ -76,11 +76,19 @@ RISC-V CPU firmware
When using the ``sifive_u`` or ``virt`` machine there are three different
firmware boot options:
-1. ``-bios default`` - This is the default behaviour if no -bios option
-is included. This option will load the default OpenSBI firmware automatically.
-The firmware is included with the QEMU release and no user interaction is
-required. All a user needs to do is specify the kernel they want to boot
-with the -kernel option
-2. ``-bios none`` - QEMU will not automatically load any firmware. It is up
-to the user to load all the images they need.
-3. ``-bios <file>`` - Tells QEMU to load the specified file as the firmware.
+
+* ``-bios default``
+
+This is the default behaviour if no ``-bios`` option is included. This option
+will load the default OpenSBI firmware automatically. The firmware is included
+with the QEMU release and no user interaction is required. All a user needs to
+do is specify the kernel they want to boot with the ``-kernel`` option
+
+* ``-bios none``
+
+QEMU will not automatically load any firmware. It is up to the user to load all
+the images they need.
+
+* ``-bios <file>``
+
+Tells QEMU to load the specified file as the firmware.
diff --git a/docs/system/target-s390x.rst b/docs/system/target-s390x.rst
index c636f64113..94c981e732 100644
--- a/docs/system/target-s390x.rst
+++ b/docs/system/target-s390x.rst
@@ -26,6 +26,7 @@ or vfio-ap is also available.
s390x/css
s390x/3270
s390x/vfio-ccw
+ s390x/pcidevices
Architectural features
======================
@@ -33,3 +34,4 @@ Architectural features
.. toctree::
s390x/bootdevices
s390x/protvirt
+ s390x/cpu-topology
diff --git a/docs/system/target-sparc.rst b/docs/system/target-sparc.rst
index b55f8d09e9..9ec8c90c14 100644
--- a/docs/system/target-sparc.rst
+++ b/docs/system/target-sparc.rst
@@ -38,7 +38,7 @@ QEMU emulates the following sun4m peripherals:
- Non Volatile RAM M48T02/M48T08
- Slave I/O: timers, interrupt controllers, Zilog serial ports,
- keyboard and power/reset logic
+ :ref:`keyboard` and power/reset logic
- ESP SCSI controller with hard disk and CD-ROM support
diff --git a/docs/system/targets.rst b/docs/system/targets.rst
index 9dcd95dd84..224fadae71 100644
--- a/docs/system/targets.rst
+++ b/docs/system/targets.rst
@@ -21,6 +21,7 @@ Contents:
target-m68k
target-mips
target-ppc
+ target-openrisc
target-riscv
target-rx
target-s390x
diff --git a/docs/system/tls.rst b/docs/system/tls.rst
index b0973afe1b..e284c82801 100644
--- a/docs/system/tls.rst
+++ b/docs/system/tls.rst
@@ -182,7 +182,7 @@ certificates.
--template client-hostNNN.info \
--outfile client-hostNNN-cert.pem
-The subject alt name extension data is not required for clients, so the
+The subject alt name extension data is not required for clients, so
the ``dns_name`` and ``ip_address`` fields are not included. The
``tls_www_client`` keyword is the key purpose extension to indicate this
certificate is intended for usage in a web client. Although QEMU network
@@ -311,7 +311,7 @@ containing one or more usernames and random keys::
mkdir -m 0700 /tmp/keys
psktool -u rich -p /tmp/keys/keys.psk
-TLS-enabled servers such as qemu-nbd can use this directory like so::
+TLS-enabled servers such as ``qemu-nbd`` can use this directory like so::
qemu-nbd \
-t -x / \
diff --git a/docs/system/vm-templating.rst b/docs/system/vm-templating.rst
new file mode 100644
index 0000000000..28905a1eeb
--- /dev/null
+++ b/docs/system/vm-templating.rst
@@ -0,0 +1,125 @@
+QEMU VM templating
+==================
+
+This document explains how to use VM templating in QEMU.
+
+For now, the focus is on VM memory aspects, and not about how to save and
+restore other VM state (i.e., migrate-to-file with ``x-ignore-shared``).
+
+Overview
+--------
+
+With VM templating, a single template VM serves as the starting point for
+new VMs. This allows for fast and efficient replication of VMs, resulting
+in fast startup times and reduced memory consumption.
+
+Conceptually, the VM state is frozen, to then be used as a basis for new
+VMs. The Copy-On-Write mechanism in the operating systems makes sure that
+new VMs are able to read template VM memory; however, any modifications
+stay private and don't modify the original template VM or any other
+created VM.
+
+!!! Security Alert !!!
+----------------------
+
+When effectively cloning VMs by VM templating, hardware identifiers
+(such as UUIDs and NIC MAC addresses), and similar data in the guest OS
+(such as machine IDs, SSH keys, certificates) that are supposed to be
+*unique* are no longer unique, which can be a security concern.
+
+Please be aware of these implications and how to mitigate them for your
+use case, which might involve vmgenid, hot(un)plug of NIC, etc..
+
+Memory configuration
+--------------------
+
+In order to create the template VM, we have to make sure that VM memory
+ends up in a file, from where it can be reused for the new VMs:
+
+Supply VM RAM via memory-backend-file, with ``share=on`` (modifications go
+to the file) and ``readonly=off`` (open the file writable). Note that
+``readonly=off`` is implicit.
+
+In the following command-line example, a 2GB VM is created, whereby VM RAM
+is to be stored in the ``template`` file.
+
+.. parsed-literal::
+
+ |qemu_system| [...] -m 2g \\
+ -object memory-backend-file,id=pc.ram,mem-path=template,size=2g,share=on,... \\
+ -machine q35,memory-backend=pc.ram
+
+If multiple memory backends are used (vNUMA, DIMMs), configure all
+memory backends accordingly.
+
+Once the VM is in the desired state, stop the VM and save other VM state,
+leaving the current state of VM RAM reside in the file.
+
+In order to have a new VM be based on a template VM, we have to
+configure VM RAM to be based on a template VM RAM file; however, the VM
+should not be able to modify file content.
+
+Supply VM RAM via memory-backend-file, with ``share=off`` (modifications
+stay private), ``readonly=on`` (open the file readonly) and ``rom=off``
+(don't make the memory readonly for the VM). Note that ``share=off`` is
+implicit and that other VM state has to be restored separately.
+
+In the following command-line example, a 2GB VM is created based on the
+existing 2GB file ``template``.
+
+.. parsed-literal::
+
+ |qemu_system| [...] -m 2g \\
+ -object memory-backend-file,id=pc.ram,mem-path=template,size=2g,readonly=on,rom=off,... \\
+ -machine q35,memory-backend=pc.ram
+
+If multiple memory backends are used (vNUMA, DIMMs), configure all
+memory backends accordingly.
+
+Note that ``-mem-path`` cannot be used for VM templating when creating the
+template VM or when starting new VMs based on a template VM.
+
+Incompatible features
+---------------------
+
+Some features are incompatible with VM templating, as the underlying file
+cannot be modified to discard VM RAM, or to actually share memory with
+another process.
+
+vhost-user and multi-process QEMU
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+vhost-user and multi-process QEMU are incompatible with VM templating.
+These technologies rely on shared memory, however, the template VMs
+don't actually share memory (``share=off``), even though they are
+file-based.
+
+virtio-balloon
+~~~~~~~~~~~~~~
+
+virtio-balloon inflation and "free page reporting" cannot discard VM RAM
+and will repeatedly report errors. While virtio-balloon can be used
+for template VMs (e.g., report VM RAM stats), "free page reporting"
+should be disabled and the balloon should not be inflated.
+
+virtio-mem
+~~~~~~~~~~
+
+virtio-mem cannot discard VM RAM that is managed by the virtio-mem
+device. virtio-mem will fail early when realizing the device. To use
+VM templating with virtio-mem, either hotplug virtio-mem devices to the
+new VM, or don't supply any memory to the template VM using virtio-mem
+(requested-size=0), not using a template VM file as memory backend for the
+virtio-mem device.
+
+VM migration
+~~~~~~~~~~~~
+
+For VM migration, "x-release-ram" similarly relies on discarding of VM
+RAM on the migration source to free up migrated RAM, and will
+repeatedly report errors.
+
+Postcopy live migration fails discarding VM RAM on the migration
+destination early and refuses to activate postcopy live migration. Note
+that postcopy live migration usually only works on selected filesystems
+(shmem/tmpfs, hugetlbfs) either way.
diff --git a/docs/throttle.txt b/docs/throttle.txt
index b5b78b7326..0a0453a5ee 100644
--- a/docs/throttle.txt
+++ b/docs/throttle.txt
@@ -273,11 +273,9 @@ A group can be created using the object-add QMP function:
"arguments": {
"qom-type": "throttle-group",
"id": "group0",
- "props": {
- "limits" : {
- "iops-total": 1000
- "bps-write": 2097152
- }
+ "limits" : {
+ "iops-total": 1000,
+ "bps-write": 2097152
}
}
}
diff --git a/docs/tools/index.rst b/docs/tools/index.rst
index 1edd5a8054..8e65ce0dfc 100644
--- a/docs/tools/index.rst
+++ b/docs/tools/index.rst
@@ -1,3 +1,5 @@
+.. _Tools:
+
-----
Tools
-----
@@ -14,4 +16,3 @@ command line utilities and other standalone programs.
qemu-pr-helper
qemu-trace-stap
virtfs-proxy-helper
- virtiofsd
diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index d58980aef8..3653adb963 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -57,7 +57,7 @@ cases. See below for a description of the supported disk formats.
*OUTPUT_FMT* is the destination format.
*OPTIONS* is a comma separated list of format specific options in a
-name=value format. Use ``-o ?`` for an overview of the options supported
+name=value format. Use ``-o help`` for an overview of the options supported
by the used format or see the format descriptions below for details.
*SNAPSHOT_PARAM* is param used for internal snapshot, format is
@@ -106,7 +106,11 @@ by the used format or see the format descriptions below for details.
.. option:: -c
- Indicates that target image must be compressed (qcow format only).
+ Indicates that target image must be compressed (qcow/qcow2 and vmdk with
+ streamOptimized subformat only).
+
+ For qcow2, the compression algorithm can be specified with the ``-o
+ compression_type=...`` option (see below).
.. option:: -h
@@ -127,9 +131,9 @@ by the used format or see the format descriptions below for details.
.. option:: -S SIZE
Indicates the consecutive number of bytes that must contain only zeros
- for qemu-img to create a sparse image during conversion. This value is rounded
- down to the nearest 512 bytes. You may use the common size suffixes like
- ``k`` for kilobytes.
+ for ``qemu-img`` to create a sparse image during conversion. This value is
+ rounded down to the nearest 512 bytes. You may use the common size suffixes
+ like ``k`` for kilobytes.
.. option:: -t CACHE
@@ -332,8 +336,8 @@ Command description:
``-r all`` fixes all kinds of errors, with a higher risk of choosing the
wrong fix or hiding corruption that has already occurred.
- Only the formats ``qcow2``, ``qed`` and ``vdi`` support
- consistency checks.
+ Only the formats ``qcow2``, ``qed``, ``parallels``, ``vhdx``, ``vmdk`` and
+ ``vdi`` support consistency checks.
In case the image does not have any inconsistencies, check exits with ``0``.
Other exit codes indicate the kind of inconsistency found or if another error
@@ -402,7 +406,7 @@ Command description:
Compare exits with ``0`` in case the images are equal and with ``1``
in case the images differ. Other exit codes mean an error occurred during
execution and standard error output should contain an error message.
- The following table sumarizes all exit codes of the compare subcommand:
+ The following table summarizes all exit codes of the compare subcommand:
0
Images are identical (or requested help was printed)
@@ -415,7 +419,7 @@ Command description:
4
Error on reading data
-.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] [--target-is-zero] [--bitmaps [--skip-broken-bitmaps]] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE [-F backing_fmt]] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
+.. option:: convert [--object OBJECTDEF] [--image-opts] [--target-image-opts] [--target-is-zero] [--bitmaps [--skip-broken-bitmaps]] [-U] [-C] [-c] [-p] [-q] [-n] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-O OUTPUT_FMT] [-B BACKING_FILE [-F BACKING_FMT]] [-o OPTIONS] [-l SNAPSHOT_PARAM] [-S SPARSE_SIZE] [-r RATE_LIMIT] [-m NUM_COROUTINES] [-W] FILENAME [FILENAME2 [...]] OUTPUT_FILENAME
Convert the disk image *FILENAME* or a snapshot *SNAPSHOT_PARAM*
to disk image *OUTPUT_FILENAME* using format *OUTPUT_FMT*. It can
@@ -431,7 +435,7 @@ Command description:
suppressed from the destination image.
*SPARSE_SIZE* indicates the consecutive number of bytes (defaults to 4k)
- that must contain only zeros for qemu-img to create a sparse image during
+ that must contain only zeros for ``qemu-img`` to create a sparse image during
conversion. If *SPARSE_SIZE* is 0, the source will not be scanned for
unallocated or zero sectors, and the destination image will always be
fully allocated.
@@ -447,7 +451,7 @@ Command description:
If the ``-n`` option is specified, the target volume creation will be
skipped. This is useful for formats such as ``rbd`` if the target
volume has already been created with site specific options that cannot
- be supplied through qemu-img.
+ be supplied through ``qemu-img``.
Out of order writes can be enabled with ``-W`` to improve performance.
This is only recommended for preallocated devices like host devices or other
@@ -463,7 +467,7 @@ Command description:
``--skip-broken-bitmaps`` is also specified to copy only the
consistent bitmaps.
-.. option:: create [--object OBJECTDEF] [-q] [-f FMT] [-b BACKING_FILE] [-F BACKING_FMT] [-u] [-o OPTIONS] FILENAME [SIZE]
+.. option:: create [--object OBJECTDEF] [-q] [-f FMT] [-b BACKING_FILE [-F BACKING_FMT]] [-u] [-o OPTIONS] FILENAME [SIZE]
Create the new disk image *FILENAME* of size *SIZE* and format
*FMT*. Depending on the file format, you can add one or more *OPTIONS*
@@ -472,7 +476,7 @@ Command description:
If the option *BACKING_FILE* is specified, then the image will record
only the differences from *BACKING_FILE*. No size needs to be specified in
this case. *BACKING_FILE* will never be modified unless you use the
- ``commit`` monitor command (or qemu-img commit).
+ ``commit`` monitor command (or ``qemu-img commit``).
If a relative path name is given, the backing file is looked up relative to
the directory containing *FILENAME*.
@@ -663,7 +667,7 @@ Command description:
List, apply, create or delete snapshots in image *FILENAME*.
-.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-p] [-u] -b BACKING_FILE [-F BACKING_FMT] FILENAME
+.. option:: rebase [--object OBJECTDEF] [--image-opts] [-U] [-q] [-f FMT] [-t CACHE] [-T SRC_CACHE] [-p] [-u] [-c] -b BACKING_FILE [-F BACKING_FMT] FILENAME
Changes the backing file of an image. Only the formats ``qcow2`` and
``qed`` support changing the backing file.
@@ -684,20 +688,22 @@ Command description:
Safe mode
This is the default mode and performs a real rebase operation. The
- new backing file may differ from the old one and qemu-img rebase
+ new backing file may differ from the old one and ``qemu-img rebase``
will take care of keeping the guest-visible content of *FILENAME*
unchanged.
In order to achieve this, any clusters that differ between
*BACKING_FILE* and the old backing file of *FILENAME* are merged
- into *FILENAME* before actually changing the backing file.
+ into *FILENAME* before actually changing the backing file. With the
+ ``-c`` option specified, the clusters which are being merged (but not
+ the entire *FILENAME* image) are compressed when written.
Note that the safe mode is an expensive operation, comparable to
converting an image. It only works if the old backing file still
exists.
Unsafe mode
- qemu-img uses the unsafe mode if ``-u`` is specified. In this
+ ``qemu-img`` uses the unsafe mode if ``-u`` is specified. In this
mode, only the backing file name and format of *FILENAME* is changed
without any checks on the file contents. The user must take care of
specifying the correct new backing file, or the guest-visible
@@ -735,7 +741,7 @@ Command description:
sizes accordingly. Failure to do so will result in data loss!
When shrinking images, the ``--shrink`` option must be given. This informs
- qemu-img that the user acknowledges all loss of data beyond the truncated
+ ``qemu-img`` that the user acknowledges all loss of data beyond the truncated
image's end.
After using this command to grow a disk image, you must use file system and
@@ -776,7 +782,7 @@ Supported image file formats:
QEMU image format, the most versatile format. Use it to have smaller
images (useful if your filesystem does not supports holes, for example
- on Windows), optional AES encryption, zlib based compression and
+ on Windows), optional AES encryption, zlib or zstd based compression and
support of multiple VM snapshots.
Supported options:
@@ -794,6 +800,17 @@ Supported image file formats:
``backing_fmt``
Image format of the base image
+ ``compression_type``
+ This option configures which compression algorithm will be used for
+ compressed clusters on the image. Note that setting this option doesn't yet
+ cause the image to actually receive compressed writes. It is most commonly
+ used with the ``-c`` option of ``qemu-img convert``, but can also be used
+ with the ``compress`` filter driver or backup block jobs with compression
+ enabled.
+
+ Valid values are ``zlib`` and ``zstd``. For images that use
+ ``compat=0.10``, only ``zlib`` compression is available.
+
``encryption``
If this option is set to ``on``, the image is encrypted with
128-bit AES-CBC.
diff --git a/docs/tools/qemu-nbd.rst b/docs/tools/qemu-nbd.rst
index e39a9f4b1a..329f44d989 100644
--- a/docs/tools/qemu-nbd.rst
+++ b/docs/tools/qemu-nbd.rst
@@ -27,18 +27,18 @@ Options
.. program:: qemu-nbd
*filename* is a disk image filename, or a set of block
-driver options if ``--image-opts`` is specified.
+driver options if :option:`--image-opts` is specified.
*dev* is an NBD device.
-.. option:: --object type,id=ID,...props...
+.. option:: --object type,id=ID,...
Define a new instance of the *type* object class identified by *ID*.
See the :manpage:`qemu(1)` manual page for full details of the properties
supported. The common object types that it makes sense to define are the
``secret`` object, which is used to supply passwords and/or encryption
keys, and the ``tls-creds`` object, which is used to supply TLS
- credentials for the qemu-nbd server or client.
+ credentials for the ``qemu-nbd`` server or client.
.. option:: -p, --port=PORT
@@ -99,8 +99,10 @@ driver options if ``--image-opts`` is specified.
.. option:: --cache=CACHE
- The cache mode to be used with the file. See the documentation of
- the emulator's ``-drive cache=...`` option for allowed values.
+ The cache mode to be used with the file. Valid values are:
+ ``none``, ``writeback`` (the default), ``writethrough``,
+ ``directsync`` and ``unsafe``. See the documentation of
+ the emulator's ``-drive cache=...`` option for more info.
.. option:: -n, --nocache
@@ -137,8 +139,7 @@ driver options if ``--image-opts`` is specified.
.. option:: -e, --shared=NUM
Allow up to *NUM* clients to share the device (default
- ``1``), 0 for unlimited. Safe for readers, but for now,
- consistency is not guaranteed between multiple writers.
+ ``1``), 0 for unlimited.
.. option:: -t, --persistent
@@ -163,9 +164,22 @@ driver options if ``--image-opts`` is specified.
.. option:: --tls-creds=ID
Enable mandatory TLS encryption for the server by setting the ID
- of the TLS credentials object previously created with the --object
- option; or provide the credentials needed for connecting as a client
- in list mode.
+ of the TLS credentials object previously created with the
+ :option:`--object` option; or provide the credentials needed for
+ connecting as a client in list mode.
+
+.. option:: --tls-hostname=hostname
+
+ When validating an x509 certificate received over a TLS connection,
+ the hostname that the NBD client used to connect will be checked
+ against information in the server provided certificate. Sometimes
+ it might be required to override the hostname used to perform this
+ check. For example, if the NBD client is using a tunnel from localhost
+ to connect to the remote server, the :option:`--tls-hostname` option should
+ be used to set the officially expected hostname of the remote NBD
+ server. This can also be used if accessing NBD over a UNIX socket
+ where there is no inherent hostname available. This is only permitted
+ when acting as a NBD client with the :option:`--list` option.
.. option:: --fork
@@ -183,7 +197,9 @@ driver options if ``--image-opts`` is specified.
.. option:: -v, --verbose
- Display extra debugging information.
+ Display extra debugging information. This option also keeps the original
+ *STDERR* stream open if the ``qemu-nbd`` process is daemonized due to
+ other options like :option:`--fork` or :option:`-c`.
.. option:: -h, --help
@@ -211,7 +227,7 @@ disconnects:
qemu-nbd -f qcow2 file.qcow2
Start a long-running server listening with encryption on port 10810,
-and whitelist clients with a specific X.509 certificate to connect to
+and allow clients with a specific X.509 certificate to connect to
a 1 megabyte subset of a raw file, using the export name 'subset':
::
@@ -236,7 +252,7 @@ daemon:
Expose the guest-visible contents of a qcow2 file via a block device
/dev/nbd0 (and possibly creating /dev/nbd0p1 and friends for
partitions found within), then disconnect the device when done.
-Access to bind qemu-nbd to an /dev/nbd device generally requires root
+Access to bind ``qemu-nbd`` to a /dev/nbd device generally requires root
privileges, and may also require the execution of ``modprobe nbd``
to enable the kernel NBD client module. *CAUTION*: Do not use
this method to mount filesystems from an untrusted guest image - a
diff --git a/docs/tools/qemu-pr-helper.rst b/docs/tools/qemu-pr-helper.rst
index eaebe40da0..c32867cfc6 100644
--- a/docs/tools/qemu-pr-helper.rst
+++ b/docs/tools/qemu-pr-helper.rst
@@ -21,8 +21,8 @@ programs because incorrect usage can disrupt regular operation of the
storage fabric. QEMU's SCSI passthrough devices ``scsi-block``
and ``scsi-generic`` support passing guest persistent reservation
requests to a privileged external helper program. :program:`qemu-pr-helper`
-is that external helper; it creates a socket which QEMU can
-connect to to communicate with it.
+is that external helper; it creates a listener socket which will
+accept incoming connections for communication with QEMU.
If you want to run VMs in a setup like this, this helper should be
started as a system service, and you should read the QEMU manual
diff --git a/docs/tools/qemu-storage-daemon.rst b/docs/tools/qemu-storage-daemon.rst
index b8ef4486f1..ea00149a63 100644
--- a/docs/tools/qemu-storage-daemon.rst
+++ b/docs/tools/qemu-storage-daemon.rst
@@ -10,9 +10,10 @@ Synopsis
Description
-----------
-qemu-storage-daemon provides disk image functionality from QEMU, qemu-img, and
-qemu-nbd in a long-running process controlled via QMP commands without running
-a virtual machine. It can export disk images, run block job operations, and
+``qemu-storage-daemon`` provides disk image functionality from QEMU,
+``qemu-img``, and ``qemu-nbd`` in a long-running process controlled via QMP
+commands without running a virtual machine.
+It can export disk images, run block job operations, and
perform other disk-related operations. The daemon is controlled via a QMP
monitor and initial configuration from the command-line.
@@ -75,7 +76,8 @@ Standard options:
.. option:: --export [type=]nbd,id=<id>,node-name=<node-name>[,name=<export-name>][,writable=on|off][,bitmap=<name>]
--export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=unix,addr.path=<socket-path>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
--export [type=]vhost-user-blk,id=<id>,node-name=<node-name>,addr.type=fd,addr.str=<fd>[,writable=on|off][,logical-block-size=<block-size>][,num-queues=<num-queues>]
- --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off]
+ --export [type=]fuse,id=<id>,node-name=<node-name>,mountpoint=<file>[,growable=on|off][,writable=on|off][,allow-other=on|off|auto]
+ --export [type=]vduse-blk,id=<id>,node-name=<node-name>,name=<vduse-name>[,writable=on|off][,num-queues=<num-queues>][,queue-size=<queue-size>][,logical-block-size=<block-size>][,serial=<serial-number>]
is a block export definition. ``node-name`` is the block node that should be
exported. ``writable`` determines whether or not the export allows write
@@ -102,7 +104,33 @@ Standard options:
mounted). Consequently, applications that have opened the given file before
the export became active will continue to see its original content. If
``growable`` is set, writes after the end of the exported file will grow the
- block node to fit.
+ block node to fit. The ``allow-other`` option controls whether users other
+ than the user running the process will be allowed to access the export. Note
+ that enabling this option as a non-root user requires enabling the
+ user_allow_other option in the global fuse.conf configuration file. Setting
+ ``allow-other`` to auto (the default) will try enabling this option, and on
+ error fall back to disabling it.
+
+ The ``vduse-blk`` export type takes a ``name`` (must be unique across the host)
+ to create the VDUSE device.
+ ``num-queues`` sets the number of virtqueues (the default is 1).
+ ``queue-size`` sets the virtqueue descriptor table size (the default is 256).
+
+ The instantiated VDUSE device must then be added to the vDPA bus using the
+ vdpa(8) command from the iproute2 project::
+
+ # vdpa dev add name <id> mgmtdev vduse
+
+ The device can be removed from the vDPA bus later as follows::
+
+ # vdpa dev del <id>
+
+ For more information about attaching vDPA devices to the host with
+ virtio_vdpa.ko or attaching them to guests with vhost_vdpa.ko, see
+ https://vdpa-dev.gitlab.io/.
+
+ For more information about VDUSE, see
+ https://docs.kernel.org/userspace-api/vduse.html.
.. option:: --monitor MONITORDEF
@@ -148,6 +176,13 @@ Standard options:
created but before accepting connections. The daemon has started successfully
when the pid file is written and clients may begin connecting.
+.. option:: --daemonize
+
+ Daemonize the process. The parent process will exit once startup is complete
+ (i.e., after the pid file has been or would have been written) or failure
+ occurs. Its exit code reflects whether the child has started up successfully
+ or failed to do so.
+
Examples
--------
Launch the daemon with QMP monitor socket ``qmp.sock`` so clients can execute
@@ -200,7 +235,7 @@ Export raw image file ``disk.img`` over NBD UNIX domain socket ``nbd.sock``::
--nbd-server addr.type=unix,addr.path=nbd.sock \
--export type=nbd,id=export,node-name=disk,writable=on
-Export a qcow2 image file ``disk.qcow2`` as a vhosts-user-blk device over UNIX
+Export a qcow2 image file ``disk.qcow2`` as a vhost-user-blk device over UNIX
domain socket ``vhost-user-blk.sock``::
$ qemu-storage-daemon \
diff --git a/docs/tools/qemu-trace-stap.rst b/docs/tools/qemu-trace-stap.rst
index d53073b52b..2169ce5d17 100644
--- a/docs/tools/qemu-trace-stap.rst
+++ b/docs/tools/qemu-trace-stap.rst
@@ -46,19 +46,19 @@ The following commands are valid:
any of the listed names. If no *PATTERN* is given, the all possible
probes will be listed.
- For example, to list all probes available in the ``qemu-system-x86_64``
+ For example, to list all probes available in the |qemu_system|
binary:
- ::
+ .. parsed-literal::
- $ qemu-trace-stap list qemu-system-x86_64
+ $ qemu-trace-stap list |qemu_system|
To filter the list to only cover probes related to QEMU's cryptographic
subsystem, in a binary outside ``$PATH``
- ::
+ .. parsed-literal::
- $ qemu-trace-stap list /opt/qemu/4.0.0/bin/qemu-system-x86_64 'qcrypto*'
+ $ qemu-trace-stap list /opt/qemu/|version|/bin/|qemu_system| 'qcrypto*'
.. option:: run OPTIONS BINARY PATTERN...
@@ -90,26 +90,26 @@ The following commands are valid:
Restrict the tracing session so that it only triggers for the process
identified by *PID*.
- For example, to monitor all processes executing ``qemu-system-x86_64``
+ For example, to monitor all processes executing |qemu_system|
as found on ``$PATH``, displaying all I/O related probes:
- ::
+ .. parsed-literal::
- $ qemu-trace-stap run qemu-system-x86_64 'qio*'
+ $ qemu-trace-stap run |qemu_system| 'qio*'
To monitor only the QEMU process with PID 1732
- ::
+ .. parsed-literal::
- $ qemu-trace-stap run --pid=1732 qemu-system-x86_64 'qio*'
+ $ qemu-trace-stap run --pid=1732 |qemu_system| 'qio*'
To monitor QEMU processes running an alternative binary outside of
``$PATH``, displaying verbose information about setup of the
tracing environment:
- ::
+ .. parsed-literal::
- $ qemu-trace-stap -v run /opt/qemu/4.0.0/qemu-system-x86_64 'qio*'
+ $ qemu-trace-stap -v run /opt/qemu/|version|/bin/|qemu_system| 'qio*'
See also
--------
diff --git a/docs/tools/virtfs-proxy-helper.rst b/docs/tools/virtfs-proxy-helper.rst
index 6cdeedf8e9..bd310ebb07 100644
--- a/docs/tools/virtfs-proxy-helper.rst
+++ b/docs/tools/virtfs-proxy-helper.rst
@@ -9,6 +9,9 @@ Synopsis
Description
-----------
+NOTE: The 9p 'proxy' backend is deprecated (since QEMU 8.1) and will be
+removed, along with this daemon, in a future version of QEMU!
+
Pass-through security model in QEMU 9p server needs root privilege to do
few file operations (like chown, chmod to any mode/uid:gid). There are two
issues in pass-through security model:
diff --git a/docs/tools/virtiofsd.rst b/docs/tools/virtiofsd.rst
deleted file mode 100644
index b208f2a6f0..0000000000
--- a/docs/tools/virtiofsd.rst
+++ /dev/null
@@ -1,360 +0,0 @@
-QEMU virtio-fs shared file system daemon
-========================================
-
-Synopsis
---------
-
-**virtiofsd** [*OPTIONS*]
-
-Description
------------
-
-Share a host directory tree with a guest through a virtio-fs device. This
-program is a vhost-user backend that implements the virtio-fs device. Each
-virtio-fs device instance requires its own virtiofsd process.
-
-This program is designed to work with QEMU's ``--device vhost-user-fs-pci``
-but should work with any virtual machine monitor (VMM) that supports
-vhost-user. See the Examples section below.
-
-This program must be run as the root user. The program drops privileges where
-possible during startup although it must be able to create and access files
-with any uid/gid:
-
-* The ability to invoke syscalls is limited using seccomp(2).
-* Linux capabilities(7) are dropped.
-
-In "namespace" sandbox mode the program switches into a new file system
-namespace and invokes pivot_root(2) to make the shared directory tree its root.
-A new pid and net namespace is also created to isolate the process.
-
-In "chroot" sandbox mode the program invokes chroot(2) to make the shared
-directory tree its root. This mode is intended for container environments where
-the container runtime has already set up the namespaces and the program does
-not have permission to create namespaces itself.
-
-Both sandbox modes prevent "file system escapes" due to symlinks and other file
-system objects that might lead to files outside the shared directory.
-
-Options
--------
-
-.. program:: virtiofsd
-
-.. option:: -h, --help
-
- Print help.
-
-.. option:: -V, --version
-
- Print version.
-
-.. option:: -d
-
- Enable debug output.
-
-.. option:: --syslog
-
- Print log messages to syslog instead of stderr.
-
-.. option:: -o OPTION
-
- * debug -
- Enable debug output.
-
- * flock|no_flock -
- Enable/disable flock. The default is ``no_flock``.
-
- * modcaps=CAPLIST
- Modify the list of capabilities allowed; CAPLIST is a colon separated
- list of capabilities, each preceded by either + or -, e.g.
- ''+sys_admin:-chown''.
-
- * log_level=LEVEL -
- Print only log messages matching LEVEL or more severe. LEVEL is one of
- ``err``, ``warn``, ``info``, or ``debug``. The default is ``info``.
-
- * posix_lock|no_posix_lock -
- Enable/disable remote POSIX locks. The default is ``no_posix_lock``.
-
- * readdirplus|no_readdirplus -
- Enable/disable readdirplus. The default is ``readdirplus``.
-
- * sandbox=namespace|chroot -
- Sandbox mode:
- - namespace: Create mount, pid, and net namespaces and pivot_root(2) into
- the shared directory.
- - chroot: chroot(2) into shared directory (use in containers).
- The default is "namespace".
-
- * source=PATH -
- Share host directory tree located at PATH. This option is required.
-
- * timeout=TIMEOUT -
- I/O timeout in seconds. The default depends on cache= option.
-
- * writeback|no_writeback -
- Enable/disable writeback cache. The cache allows the FUSE client to buffer
- and merge write requests. The default is ``no_writeback``.
-
- * xattr|no_xattr -
- Enable/disable extended attributes (xattr) on files and directories. The
- default is ``no_xattr``.
-
- * posix_acl|no_posix_acl -
- Enable/disable posix acl support. Posix ACLs are disabled by default.
-
-.. option:: --socket-path=PATH
-
- Listen on vhost-user UNIX domain socket at PATH.
-
-.. option:: --socket-group=GROUP
-
- Set the vhost-user UNIX domain socket gid to GROUP.
-
-.. option:: --fd=FDNUM
-
- Accept connections from vhost-user UNIX domain socket file descriptor FDNUM.
- The file descriptor must already be listening for connections.
-
-.. option:: --thread-pool-size=NUM
-
- Restrict the number of worker threads per request queue to NUM. The default
- is 64.
-
-.. option:: --cache=none|auto|always
-
- Select the desired trade-off between coherency and performance. ``none``
- forbids the FUSE client from caching to achieve best coherency at the cost of
- performance. ``auto`` acts similar to NFS with a 1 second metadata cache
- timeout. ``always`` sets a long cache lifetime at the expense of coherency.
- The default is ``auto``.
-
-Extended attribute (xattr) mapping
-----------------------------------
-
-By default the name of xattr's used by the client are passed through to the server
-file system. This can be a problem where either those xattr names are used
-by something on the server (e.g. selinux client/server confusion) or if the
-virtiofsd is running in a container with restricted privileges where it cannot
-access some attributes.
-
-Mapping syntax
-~~~~~~~~~~~~~~
-
-A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping``
-string consists of a series of rules.
-
-The first matching rule terminates the mapping.
-The set of rules must include a terminating rule to match any remaining attributes
-at the end.
-
-Each rule consists of a number of fields separated with a separator that is the
-first non-white space character in the rule. This separator must then be used
-for the whole rule.
-White space may be added before and after each rule.
-
-Using ':' as the separator a rule is of the form:
-
-``:type:scope:key:prepend:``
-
-**scope** is:
-
-- 'client' - match 'key' against a xattr name from the client for
- setxattr/getxattr/removexattr
-- 'server' - match 'prepend' against a xattr name from the server
- for listxattr
-- 'all' - can be used to make a single rule where both the server
- and client matches are triggered.
-
-**type** is one of:
-
-- 'prefix' - is designed to prepend and strip a prefix; the modified
- attributes then being passed on to the client/server.
-
-- 'ok' - Causes the rule set to be terminated when a match is found
- while allowing matching xattr's through unchanged.
- It is intended both as a way of explicitly terminating
- the list of rules, and to allow some xattr's to skip following rules.
-
-- 'bad' - If a client tries to use a name matching 'key' it's
- denied using EPERM; when the server passes an attribute
- name matching 'prepend' it's hidden. In many ways it's use is very like
- 'ok' as either an explicit terminator or for special handling of certain
- patterns.
-
-**key** is a string tested as a prefix on an attribute name originating
-on the client. It maybe empty in which case a 'client' rule
-will always match on client names.
-
-**prepend** is a string tested as a prefix on an attribute name originating
-on the server, and used as a new prefix. It may be empty
-in which case a 'server' rule will always match on all names from
-the server.
-
-e.g.:
-
- ``:prefix:client:trusted.:user.virtiofs.:``
-
- will match 'trusted.' attributes in client calls and prefix them before
- passing them to the server.
-
- ``:prefix:server::user.virtiofs.:``
-
- will strip 'user.virtiofs.' from all server replies.
-
- ``:prefix:all:trusted.:user.virtiofs.:``
-
- combines the previous two cases into a single rule.
-
- ``:ok:client:user.::``
-
- will allow get/set xattr for 'user.' xattr's and ignore
- following rules.
-
- ``:ok:server::security.:``
-
- will pass 'securty.' xattr's in listxattr from the server
- and ignore following rules.
-
- ``:ok:all:::``
-
- will terminate the rule search passing any remaining attributes
- in both directions.
-
- ``:bad:server::security.:``
-
- would hide 'security.' xattr's in listxattr from the server.
-
-A simpler 'map' type provides a shorter syntax for the common case:
-
-``:map:key:prepend:``
-
-The 'map' type adds a number of separate rules to add **prepend** as a prefix
-to the matched **key** (or all attributes if **key** is empty).
-There may be at most one 'map' rule and it must be the last rule in the set.
-
-Note: When the 'security.capability' xattr is remapped, the daemon has to do
-extra work to remove it during many operations, which the host kernel normally
-does itself.
-
-Security considerations
-~~~~~~~~~~~~~~~~~~~~~~~
-
-Operating systems typically partition the xattr namespace using
-well defined name prefixes. Each partition may have different
-access controls applied. For example, on Linux there are multiple
-partitions
-
- * ``system.*`` - access varies depending on attribute & filesystem
- * ``security.*`` - only processes with CAP_SYS_ADMIN
- * ``trusted.*`` - only processes with CAP_SYS_ADMIN
- * ``user.*`` - any process granted by file permissions / ownership
-
-While other OS such as FreeBSD have different name prefixes
-and access control rules.
-
-When remapping attributes on the host, it is important to
-ensure that the remapping does not allow a guest user to
-evade the guest access control rules.
-
-Consider if ``trusted.*`` from the guest was remapped to
-``user.virtiofs.trusted*`` in the host. An unprivileged
-user in a Linux guest has the ability to write to xattrs
-under ``user.*``. Thus the user can evade the access
-control restriction on ``trusted.*`` by instead writing
-to ``user.virtiofs.trusted.*``.
-
-As noted above, the partitions used and access controls
-applied, will vary across guest OS, so it is not wise to
-try to predict what the guest OS will use.
-
-The simplest way to avoid an insecure configuration is
-to remap all xattrs at once, to a given fixed prefix.
-This is shown in example (1) below.
-
-If selectively mapping only a subset of xattr prefixes,
-then rules must be added to explicitly block direct
-access to the target of the remapping. This is shown
-in example (2) below.
-
-Mapping examples
-~~~~~~~~~~~~~~~~
-
-1) Prefix all attributes with 'user.virtiofs.'
-
-::
-
- -o xattrmap=":prefix:all::user.virtiofs.::bad:all:::"
-
-
-This uses two rules, using : as the field separator;
-the first rule prefixes and strips 'user.virtiofs.',
-the second rule hides any non-prefixed attributes that
-the host set.
-
-This is equivalent to the 'map' rule:
-
-::
-
- -o xattrmap=":map::user.virtiofs.:"
-
-2) Prefix 'trusted.' attributes, allow others through
-
-::
-
- "/prefix/all/trusted./user.virtiofs./
- /bad/server//trusted./
- /bad/client/user.virtiofs.//
- /ok/all///"
-
-
-Here there are four rules, using / as the field
-separator, and also demonstrating that new lines can
-be included between rules.
-The first rule is the prefixing of 'trusted.' and
-stripping of 'user.virtiofs.'.
-The second rule hides unprefixed 'trusted.' attributes
-on the host.
-The third rule stops a guest from explicitly setting
-the 'user.virtiofs.' path directly to prevent access
-control bypass on the target of the earlier prefix
-remapping.
-Finally, the fourth rule lets all remaining attributes
-through.
-
-This is equivalent to the 'map' rule:
-
-::
-
- -o xattrmap="/map/trusted./user.virtiofs./"
-
-3) Hide 'security.' attributes, and allow everything else
-
-::
-
- "/bad/all/security./security./
- /ok/all///'
-
-The first rule combines what could be separate client and server
-rules into a single 'all' rule, matching 'security.' in either
-client arguments or lists returned from the host. This stops
-the client seeing any 'security.' attributes on the server and
-stops it setting any.
-
-Examples
---------
-
-Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket
-``/var/run/vm001-vhost-fs.sock``:
-
-.. parsed-literal::
-
- host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001
- host# |qemu_system| \\
- -chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \\
- -device vhost-user-fs-pci,chardev=char0,tag=myfs \\
- -object memory-backend-memfd,id=mem,size=4G,share=on \\
- -numa node,memdev=mem \\
- ...
- guest# mount -t virtiofs myfs /mnt
diff --git a/docs/u2f.txt b/docs/u2f.txt
deleted file mode 100644
index 8f44994818..0000000000
--- a/docs/u2f.txt
+++ /dev/null
@@ -1,110 +0,0 @@
-QEMU U2F Key Device Documentation.
-
-Contents
-1. USB U2F key device
-2. Building
-3. Using u2f-emulated
-4. Using u2f-passthru
-5. Libu2f-emu
-
-1. USB U2F key device
-
-U2F is an open authentication standard that enables relying parties
-exposed to the internet to offer a strong second factor option for end
-user authentication.
-
-The standard brings many advantages to both parties, client and server,
-allowing to reduce over-reliance on passwords, it increases authentication
-security and simplifies passwords.
-
-The second factor is materialized by a device implementing the U2F
-protocol. In case of a USB U2F security key, it is a USB HID device
-that implements the U2F protocol.
-
-In Qemu, the USB U2F key device offers a dedicated support of U2F, allowing
-guest USB FIDO/U2F security keys operating in two possible modes:
-pass-through and emulated.
-
-The pass-through mode consists of passing all requests made from the guest
-to the physical security key connected to the host machine and vice versa.
-In addition, the dedicated pass-through allows to have a U2F security key
-shared on several guests which is not possible with a simple host device
-assignment pass-through.
-
-The emulated mode consists of completely emulating the behavior of an
-U2F device through software part. Libu2f-emu is used for that.
-
-
-2. Building
-
-To ensure the build of the u2f-emulated device variant which depends
-on libu2f-emu: configuring and building:
-
- ./configure --enable-u2f && make
-
-The pass-through mode is built by default on Linux. To take advantage
-of the autoscan option it provides, make sure you have a working libudev
-installed on the host.
-
-
-3. Using u2f-emulated
-
-To work, an emulated U2F device must have four elements:
- * ec x509 certificate
- * ec private key
- * counter (four bytes value)
- * 48 bytes of entropy (random bits)
-
-To use this type of device, this one has to be configured, and these
-four elements must be passed one way or another.
-
-Assuming that you have a working libu2f-emu installed on the host.
-There are three possible ways of configurations:
- * ephemeral
- * setup directory
- * manual
-
-Ephemeral is the simplest way to configure, it lets the device generate
-all the elements it needs for a single use of the lifetime of the device.
-
- qemu -usb -device u2f-emulated
-
-Setup directory allows to configure the device from a directory containing
-four files:
- * certificate.pem: ec x509 certificate
- * private-key.pem: ec private key
- * counter: counter value
- * entropy: 48 bytes of entropy
-
- qemu -usb -device u2f-emulated,dir=$dir
-
-Manual allows to configure the device more finely by specifying each
-of the elements necessary for the device:
- * cert
- * priv
- * counter
- * entropy
-
- qemu -usb -device u2f-emulated,cert=$DIR1/$FILE1,priv=$DIR2/$FILE2,counter=$DIR3/$FILE3,entropy=$DIR4/$FILE4
-
-
-4. Using u2f-passthru
-
-On the host specify the u2f-passthru device with a suitable hidraw:
-
- qemu -usb -device u2f-passthru,hidraw=/dev/hidraw0
-
-Alternately, the u2f-passthru device can autoscan to take the first
-U2F device it finds on the host (this requires a working libudev):
-
- qemu -usb -device u2f-passthru
-
-
-5. Libu2f-emu
-
-The u2f-emulated device uses libu2f-emu for the U2F key emulation. Libu2f-emu
-implements completely the U2F protocol device part for all specified
-transport given by the FIDO Alliance.
-
-For more information about libu2f-emu see this page:
-https://github.com/MattGorko/libu2f-emu.
diff --git a/docs/user/index.rst b/docs/user/index.rst
index 2c4e29f3db..782d27cda2 100644
--- a/docs/user/index.rst
+++ b/docs/user/index.rst
@@ -1,3 +1,5 @@
+.. _User Mode Emulation:
+
-------------------
User Mode Emulation
-------------------
diff --git a/docs/user/main.rst b/docs/user/main.rst
index e08d4be63b..e04bc2cb86 100644
--- a/docs/user/main.rst
+++ b/docs/user/main.rst
@@ -87,14 +87,13 @@ Debug options:
Activate logging of the specified items (use '-d help' for a list of
log items)
-``-p pagesize``
- Act as if the host page size was 'pagesize' bytes
-
``-g port``
Wait gdb connection to port
-``-singlestep``
- Run the emulation in single step mode.
+``-one-insn-per-tb``
+ Run the emulation with one guest instruction per translation block.
+ This slows down emulation a lot, but can be useful in some situations,
+ such as when trying to analyse the logs produced by the ``-d`` option.
Environment variables:
@@ -160,13 +159,8 @@ Other binaries
* ``qemu-mipsn32el`` executes 32-bit little endian MIPS binaries (MIPS N32
ABI).
-- user mode (NiosII)
-
- * ``qemu-nios2`` TODO.
-
- user mode (PowerPC)
- * ``qemu-ppc64abi32`` TODO.
* ``qemu-ppc64`` TODO.
* ``qemu-ppc`` TODO.
@@ -243,5 +237,7 @@ Debug options:
``-p pagesize``
Act as if the host page size was 'pagesize' bytes
-``-singlestep``
- Run the emulation in single step mode.
+``-one-insn-per-tb``
+ Run the emulation with one guest instruction per translation block.
+ This slows down emulation a lot, but can be useful in some situations,
+ such as when trying to analyse the logs produced by the ``-d`` option.