aboutsummaryrefslogtreecommitdiff
path: root/docs/pvrdma.txt
diff options
context:
space:
mode:
Diffstat (limited to 'docs/pvrdma.txt')
-rw-r--r--docs/pvrdma.txt345
1 files changed, 0 insertions, 345 deletions
diff --git a/docs/pvrdma.txt b/docs/pvrdma.txt
deleted file mode 100644
index 5c122fe818..0000000000
--- a/docs/pvrdma.txt
+++ /dev/null
@@ -1,345 +0,0 @@
-Paravirtualized RDMA Device (PVRDMA)
-====================================
-
-
-1. Description
-===============
-PVRDMA is the QEMU implementation of VMware's paravirtualized RDMA device.
-It works with its Linux Kernel driver AS IS, no need for any special guest
-modifications.
-
-While it complies with the VMware device, it can also communicate with bare
-metal RDMA-enabled machines as peers.
-
-It does not require an RDMA HCA in the host, it can work with Soft-RoCE (rxe).
-
-It does not require the whole guest RAM to be pinned allowing memory
-over-commit and, even if not implemented yet, migration support will be
-possible with some HW assistance.
-
-A project presentation accompany this document:
-- https://blog.linuxplumbersconf.org/2017/ocw/system/presentations/4730/original/lpc-2017-pvrdma-marcel-apfelbaum-yuval-shaia.pdf
-
-
-
-2. Setup
-========
-
-
-2.1 Guest setup
-===============
-Fedora 27+ kernels work out of the box, older distributions
-require updating the kernel to 4.14 to include the pvrdma driver.
-
-However the libpvrdma library needed by User Level Software is still
-not available as part of the distributions, so the rdma-core library
-needs to be compiled and optionally installed.
-
-Please follow the instructions at:
- https://github.com/linux-rdma/rdma-core.git
-
-
-2.2 Host Setup
-==============
-The pvrdma backend is an ibdevice interface that can be exposed
-either by a Soft-RoCE(rxe) device on machines with no RDMA device,
-or an HCA SRIOV function(VF/PF).
-Note that ibdevice interfaces can't be shared between pvrdma devices,
-each one requiring a separate instance (rxe or SRIOV VF).
-
-
-2.2.1 Soft-RoCE backend(rxe)
-===========================
-A stable version of rxe is required, Fedora 27+ or a Linux
-Kernel 4.14+ is preferred.
-
-The rdma_rxe module is part of the Linux Kernel but not loaded by default.
-Install the User Level library (librxe) following the instructions from:
-https://github.com/SoftRoCE/rxe-dev/wiki/rxe-dev:-Home
-
-Associate an ETH interface with rxe by running:
- rxe_cfg add eth0
-An rxe0 ibdevice interface will be created and can be used as pvrdma backend.
-
-
-2.2.2 RDMA device Virtual Function backend
-==========================================
-Nothing special is required, the pvrdma device can work not only with
-Ethernet Links, but also Infinibands Links.
-All is needed is an ibdevice with an active port, for Mellanox cards
-will be something like mlx5_6 which can be the backend.
-
-
-2.2.3 QEMU setup
-================
-Configure QEMU with --enable-rdma flag, installing
-the required RDMA libraries.
-
-
-
-3. Usage
-========
-
-
-3.1 VM Memory settings
-======================
-Currently the device is working only with memory backed RAM
-and it must be mark as "shared":
- -m 1G \
- -object memory-backend-ram,id=mb1,size=1G,share \
- -numa node,memdev=mb1 \
-
-
-3.2 MAD Multiplexer
-===================
-MAD Multiplexer is a service that exposes MAD-like interface for VMs in
-order to overcome the limitation where only single entity can register with
-MAD layer to send and receive RDMA-CM MAD packets.
-
-To build rdmacm-mux run
-# make rdmacm-mux
-
-Before running the rdmacm-mux make sure that both ib_cm and rdma_cm kernel
-modules aren't loaded, otherwise the rdmacm-mux service will fail to start.
-
-The application accepts 3 command line arguments and exposes a UNIX socket
-to pass control and data to it.
--d rdma-device-name Name of RDMA device to register with
--s unix-socket-path Path to unix socket to listen (default /var/run/rdmacm-mux)
--p rdma-device-port Port number of RDMA device to register with (default 1)
-The final UNIX socket file name is a concatenation of the 3 arguments so
-for example for device mlx5_0 on port 2 this /var/run/rdmacm-mux-mlx5_0-2
-will be created.
-
-pvrdma requires this service.
-
-Please refer to contrib/rdmacm-mux for more details.
-
-
-3.3 Service exposed by libvirt daemon
-=====================================
-The control over the RDMA device's GID table is done by updating the
-device's Ethernet function addresses.
-Usually the first GID entry is determined by the MAC address, the second by
-the first IPv6 address and the third by the IPv4 address. Other entries can
-be added by adding more IP addresses. The opposite is the same, i.e.
-whenever an address is removed, the corresponding GID entry is removed.
-The process is done by the network and RDMA stacks. Whenever an address is
-added the ib_core driver is notified and calls the device driver add_gid
-function which in turn update the device.
-To support this in pvrdma device the device hooks into the create_bind and
-destroy_bind HW commands triggered by pvrdma driver in guest.
-
-Whenever changed is made to the pvrdma port's GID table a special QMP
-messages is sent to be processed by libvirt to update the address of the
-backend Ethernet device.
-
-pvrdma requires that libvirt service will be up.
-
-
-3.4 PCI devices settings
-========================
-RoCE device exposes two functions - an Ethernet and RDMA.
-To support it, pvrdma device is composed of two PCI functions, an Ethernet
-device of type vmxnet3 on PCI slot 0 and a PVRDMA device on PCI slot 1. The
-Ethernet function can be used for other Ethernet purposes such as IP.
-
-
-3.5 Device parameters
-=====================
-- netdev: Specifies the Ethernet device function name on the host for
- example enp175s0f0. For Soft-RoCE device (rxe) this would be the Ethernet
- device used to create it.
-- ibdev: The IB device name on host for example rxe0, mlx5_0 etc.
-- mad-chardev: The name of the MAD multiplexer char device.
-- ibport: In case of multi-port device (such as Mellanox's HCA) this
- specify the port to use. If not set 1 will be used.
-- dev-caps-max-mr-size: The maximum size of MR.
-- dev-caps-max-qp: Maximum number of QPs.
-- dev-caps-max-cq: Maximum number of CQs.
-- dev-caps-max-mr: Maximum number of MRs.
-- dev-caps-max-pd: Maximum number of PDs.
-- dev-caps-max-ah: Maximum number of AHs.
-
-Notes:
-- The first 3 parameters are mandatory settings, the rest have their
- defaults.
-- The last 8 parameters (the ones that prefixed by dev-caps) defines the top
- limits but the final values is adjusted by the backend device limitations.
-- netdev can be extracted from ibdev's sysfs
- (/sys/class/infiniband/<ibdev>/device/net/)
-
-
-3.6 Example
-===========
-Define bridge device with vmxnet3 network backend:
-<interface type='bridge'>
- <mac address='56:b4:44:e9:62:dc'/>
- <source bridge='bridge1'/>
- <model type='vmxnet3'/>
- <address type='pci' domain='0x0000' bus='0x00' slot='0x10' function='0x0' multifunction='on'/>
-</interface>
-
-Define pvrdma device:
-<qemu:commandline>
- <qemu:arg value='-object'/>
- <qemu:arg value='memory-backend-ram,id=mb1,size=1G,share'/>
- <qemu:arg value='-numa'/>
- <qemu:arg value='node,memdev=mb1'/>
- <qemu:arg value='-chardev'/>
- <qemu:arg value='socket,path=/var/run/rdmacm-mux-rxe0-1,id=mads'/>
- <qemu:arg value='-device'/>
- <qemu:arg value='pvrdma,addr=10.1,ibdev=rxe0,netdev=bridge0,mad-chardev=mads'/>
-</qemu:commandline>
-
-
-
-4. Implementation details
-=========================
-
-
-4.1 Overview
-============
-The device acts like a proxy between the Guest Driver and the host
-ibdevice interface.
-On configuration path:
- - For every hardware resource request (PD/QP/CQ/...) the pvrdma will request
- a resource from the backend interface, maintaining a 1-1 mapping
- between the guest and host.
-On data path:
- - Every post_send/receive received from the guest will be converted into
- a post_send/receive for the backend. The buffers data will not be touched
- or copied resulting in near bare-metal performance for large enough buffers.
- - Completions from the backend interface will result in completions for
- the pvrdma device.
-
-
-4.2 PCI BARs
-============
-PCI Bars:
- BAR 0 - MSI-X
- MSI-X vectors:
- (0) Command - used when execution of a command is completed.
- (1) Async - not in use.
- (2) Completion - used when a completion event is placed in
- device's CQ ring.
- BAR 1 - Registers
- --------------------------------------------------------
- | VERSION | DSR | CTL | REQ | ERR | ICR | IMR | MAC |
- --------------------------------------------------------
- DSR - Address of driver/device shared memory used
- for the command channel, used for passing:
- - General info such as driver version
- - Address of 'command' and 'response'
- - Address of async ring
- - Address of device's CQ ring
- - Device capabilities
- CTL - Device control operations (activate, reset etc)
- IMG - Set interrupt mask
- REQ - Command execution register
- ERR - Operation status
-
- BAR 2 - UAR
- ---------------------------------------------------------
- | QP_NUM | SEND/RECV Flag || CQ_NUM | ARM/POLL Flag |
- ---------------------------------------------------------
- - Offset 0 used for QP operations (send and recv)
- - Offset 4 used for CQ operations (arm and poll)
-
-
-4.3 Major flows
-===============
-
-4.3.1 Create CQ
-===============
- - Guest driver
- - Allocates pages for CQ ring
- - Creates page directory (pdir) to hold CQ ring's pages
- - Initializes CQ ring
- - Initializes 'Create CQ' command object (cqe, pdir etc)
- - Copies the command to 'command' address
- - Writes 0 into REQ register
- - Device
- - Reads the request object from the 'command' address
- - Allocates CQ object and initialize CQ ring based on pdir
- - Creates the backend CQ
- - Writes operation status to ERR register
- - Posts command-interrupt to guest
- - Guest driver
- - Reads the HW response code from ERR register
-
-4.3.2 Create QP
-===============
- - Guest driver
- - Allocates pages for send and receive rings
- - Creates page directory(pdir) to hold the ring's pages
- - Initializes 'Create QP' command object (max_send_wr,
- send_cq_handle, recv_cq_handle, pdir etc)
- - Copies the object to 'command' address
- - Write 0 into REQ register
- - Device
- - Reads the request object from 'command' address
- - Allocates the QP object and initialize
- - Send and recv rings based on pdir
- - Send and recv ring state
- - Creates the backend QP
- - Writes the operation status to ERR register
- - Posts command-interrupt to guest
- - Guest driver
- - Reads the HW response code from ERR register
-
-4.3.3 Post receive
-==================
- - Guest driver
- - Initializes a wqe and place it on recv ring
- - Write to qpn|qp_recv_bit (31) to QP offset in UAR
- - Device
- - Extracts qpn from UAR
- - Walks through the ring and does the following for each wqe
- - Prepares the backend CQE context to be used when
- receiving completion from backend (wr_id, op_code, emu_cq_num)
- - For each sge prepares backend sge
- - Calls backend's post_recv
-
-4.3.4 Process backend events
-============================
- - Done by a dedicated thread used to process backend events;
- at initialization is attached to the device and creates
- the communication channel.
- - Thread main loop:
- - Polls for completions
- - Extracts QEMU _cq_num, wr_id and op_code from context
- - Writes CQE to CQ ring
- - Writes CQ number to device CQ
- - Sends completion-interrupt to guest
- - Deallocates context
- - Acks the event to backend
-
-
-
-5. Limitations
-==============
-- The device obviously is limited by the Guest Linux Driver features implementation
- of the VMware device API.
-- Memory registration mechanism requires mremap for every page in the buffer in order
- to map it to a contiguous virtual address range. Since this is not the data path
- it should not matter much. If the default max mr size is increased, be aware that
- memory registration can take up to 0.5 seconds for 1GB of memory.
-- The device requires target page size to be the same as the host page size,
- otherwise it will fail to init.
-- QEMU cannot map guest RAM from a file descriptor if a pvrdma device is attached,
- so it can't work with huge pages. The limitation will be addressed in the future,
- however QEMU allocates Guest RAM with MADV_HUGEPAGE so if there are enough huge
- pages available, QEMU will use them. QEMU will fail to init if the requirements
- are not met.
-
-
-
-6. Performance
-==============
-By design the pvrdma device exits on each post-send/receive, so for small buffers
-the performance is affected; however for medium buffers it will became close to
-bare metal and from 1MB buffers and up it reaches bare metal performance.
-(tested with 2 VMs, the pvrdma devices connected to 2 VFs of the same device)
-
-All the above assumes no memory registration is done on data path.