path: root/Documentation/infiniband
diff options
Diffstat (limited to 'Documentation/infiniband')
5 files changed, 502 insertions, 0 deletions
diff --git a/Documentation/infiniband/core_locking.txt b/Documentation/infiniband/core_locking.txt
new file mode 100644
index 00000000..e1678542
--- /dev/null
+++ b/Documentation/infiniband/core_locking.txt
@@ -0,0 +1,114 @@
+ This guide is an attempt to make explicit the locking assumptions
+ made by the InfiniBand midlayer. It describes the requirements on
+ both low-level drivers that sit below the midlayer and upper level
+ protocols that use the midlayer.
+Sleeping and interrupt context
+ With the following exceptions, a low-level driver implementation of
+ all of the methods in struct ib_device may sleep. The exceptions
+ are any methods from the list:
+ create_ah
+ modify_ah
+ query_ah
+ destroy_ah
+ bind_mw
+ post_send
+ post_recv
+ poll_cq
+ req_notify_cq
+ map_phys_fmr
+ which may not sleep and must be callable from any context.
+ The corresponding functions exported to upper level protocol
+ consumers:
+ ib_create_ah
+ ib_modify_ah
+ ib_query_ah
+ ib_destroy_ah
+ ib_bind_mw
+ ib_post_send
+ ib_post_recv
+ ib_req_notify_cq
+ ib_map_phys_fmr
+ are therefore safe to call from any context.
+ In addition, the function
+ ib_dispatch_event
+ used by low-level drivers to dispatch asynchronous events through
+ the midlayer is also safe to call from any context.
+ All of the methods in struct ib_device exported by a low-level
+ driver must be fully reentrant. The low-level driver is required to
+ perform all synchronization necessary to maintain consistency, even
+ if multiple function calls using the same object are run
+ simultaneously.
+ The IB midlayer does not perform any serialization of function calls.
+ Because low-level drivers are reentrant, upper level protocol
+ consumers are not required to perform any serialization. However,
+ some serialization may be required to get sensible results. For
+ example, a consumer may safely call ib_poll_cq() on multiple CPUs
+ simultaneously. However, the ordering of the work completion
+ information between different calls of ib_poll_cq() is not defined.
+ A low-level driver must not perform a callback directly from the
+ same callchain as an ib_device method call. For example, it is not
+ allowed for a low-level driver to call a consumer's completion event
+ handler directly from its post_send method. Instead, the low-level
+ driver should defer this callback by, for example, scheduling a
+ tasklet to perform the callback.
+ The low-level driver is responsible for ensuring that multiple
+ completion event handlers for the same CQ are not called
+ simultaneously. The driver must guarantee that only one CQ event
+ handler for a given CQ is running at a time. In other words, the
+ following situation is not allowed:
+ low-level driver ->
+ consumer CQ event callback:
+ /* ... */
+ ib_req_notify_cq(cq, ...);
+ low-level driver ->
+ /* ... */ consumer CQ event callback:
+ /* ... */
+ return from CQ event handler
+ The context in which completion event and asynchronous event
+ callbacks run is not defined. Depending on the low-level driver, it
+ may be process context, softirq context, or interrupt context.
+ Upper level protocol consumers may not sleep in a callback.
+ A low-level driver announces that a device is ready for use by
+ consumers when it calls ib_register_device(), all initialization
+ must be complete before this call. The device must remain usable
+ until the driver's call to ib_unregister_device() has returned.
+ A low-level driver must call ib_register_device() and
+ ib_unregister_device() from process context. It must not hold any
+ semaphores that could cause deadlock if a consumer calls back into
+ the driver across these calls.
+ An upper level protocol consumer may begin using an IB device as
+ soon as the add method of its struct ib_client is called for that
+ device. A consumer must finish all cleanup and free all resources
+ relating to a device before returning from the remove method.
+ A consumer is permitted to sleep in its add and remove methods.
diff --git a/Documentation/infiniband/ipoib.txt b/Documentation/infiniband/ipoib.txt
new file mode 100644
index 00000000..f2cfe265
--- /dev/null
+++ b/Documentation/infiniband/ipoib.txt
@@ -0,0 +1,105 @@
+ The ib_ipoib driver is an implementation of the IP over InfiniBand
+ protocol as specified by RFC 4391 and 4392, issued by the IETF ipoib
+ working group. It is a "native" implementation in the sense of
+ setting the interface type to ARPHRD_INFINIBAND and the hardware
+ address length to 20 (earlier proprietary implementations
+ masqueraded to the kernel as ethernet interfaces).
+Partitions and P_Keys
+ When the IPoIB driver is loaded, it creates one interface for each
+ port using the P_Key at index 0. To create an interface with a
+ different P_Key, write the desired P_Key into the main interface's
+ /sys/class/net/<intf name>/create_child file. For example:
+ echo 0x8001 > /sys/class/net/ib0/create_child
+ This will create an interface named ib0.8001 with P_Key 0x8001. To
+ remove a subinterface, use the "delete_child" file:
+ echo 0x8001 > /sys/class/net/ib0/delete_child
+ The P_Key for any interface is given by the "pkey" file, and the
+ main interface for a subinterface is in "parent."
+ Child interface create/delete can also be done using IPoIB's
+ rtnl_link_ops, where childs created using either way behave the same.
+Datagram vs Connected modes
+ The IPoIB driver supports two modes of operation: datagram and
+ connected. The mode is set and read through an interface's
+ /sys/class/net/<intf name>/mode file.
+ In datagram mode, the IB UD (Unreliable Datagram) transport is used
+ and so the interface MTU has is equal to the IB L2 MTU minus the
+ IPoIB encapsulation header (4 bytes). For example, in a typical IB
+ fabric with a 2K MTU, the IPoIB MTU will be 2048 - 4 = 2044 bytes.
+ In connected mode, the IB RC (Reliable Connected) transport is used.
+ Connected mode takes advantage of the connected nature of the IB
+ transport and allows an MTU up to the maximal IP packet size of 64K,
+ which reduces the number of IP packets needed for handling large UDP
+ datagrams, TCP segments, etc and increases the performance for large
+ messages.
+ In connected mode, the interface's UD QP is still used for multicast
+ and communication with peers that don't support connected mode. In
+ this case, RX emulation of ICMP PMTU packets is used to cause the
+ networking stack to use the smaller UD MTU for these neighbours.
+Stateless offloads
+ If the IB HW supports IPoIB stateless offloads, IPoIB advertises
+ TCP/IP checksum and/or Large Send (LSO) offloading capability to the
+ network stack.
+ Large Receive (LRO) offloading is also implemented and may be turned
+ on/off using ethtool calls. Currently LRO is supported only for
+ checksum offload capable devices.
+ Stateless offloads are supported only in datagram mode.
+Interrupt moderation
+ If the underlying IB device supports CQ event moderation, one can
+ use ethtool to set interrupt mitigation parameters and thus reduce
+ the overhead incurred by handling interrupts. The main code path of
+ IPoIB doesn't use events for TX completion signaling so only RX
+ moderation is supported.
+Debugging Information
+ By compiling the IPoIB driver with CONFIG_INFINIBAND_IPOIB_DEBUG set
+ to 'y', tracing messages are compiled into the driver. They are
+ turned on by setting the module parameters debug_level and
+ mcast_debug_level to 1. These parameters can be controlled at
+ runtime through files in /sys/module/ib_ipoib/.
+ CONFIG_INFINIBAND_IPOIB_DEBUG also enables files in the debugfs
+ virtual filesystem. By mounting this filesystem, for example with
+ mount -t debugfs none /sys/kernel/debug
+ it is possible to get statistics about multicast groups from the
+ files /sys/kernel/debug/ipoib/ib0_mcg and so on.
+ The performance impact of this option is negligible, so it
+ is safe to enable this option with debug_level set to 0 for normal
+ operation.
+ CONFIG_INFINIBAND_IPOIB_DEBUG_DATA enables even more debug output in
+ the data path when data_debug_level is set to 1. However, even with
+ the output disabled, enabling this configuration option will affect
+ performance, because it adds tests to the fast path.
+ Transmission of IP over InfiniBand (IPoIB) (RFC 4391)
+ http://ietf.org/rfc/rfc4391.txt
+ IP over InfiniBand (IPoIB) Architecture (RFC 4392)
+ http://ietf.org/rfc/rfc4392.txt
+ IP over InfiniBand: Connected Mode (RFC 4755)
+ http://ietf.org/rfc/rfc4755.txt
diff --git a/Documentation/infiniband/sysfs.txt b/Documentation/infiniband/sysfs.txt
new file mode 100644
index 00000000..ddd519b7
--- /dev/null
+++ b/Documentation/infiniband/sysfs.txt
@@ -0,0 +1,66 @@
+ For each InfiniBand device, the InfiniBand drivers create the
+ following files under /sys/class/infiniband/<device name>:
+ node_type - Node type (CA, switch or router)
+ node_guid - Node GUID
+ sys_image_guid - System image GUID
+ In addition, there is a "ports" subdirectory, with one subdirectory
+ for each port. For example, if mthca0 is a 2-port HCA, there will
+ be two directories:
+ /sys/class/infiniband/mthca0/ports/1
+ /sys/class/infiniband/mthca0/ports/2
+ (A switch will only have a single "0" subdirectory for switch port
+ 0; no subdirectory is created for normal switch ports)
+ In each port subdirectory, the following files are created:
+ cap_mask - Port capability mask
+ lid - Port LID
+ lid_mask_count - Port LID mask count
+ rate - Port data rate (active width * active speed)
+ sm_lid - Subnet manager LID for port's subnet
+ sm_sl - Subnet manager SL for port's subnet
+ state - Port state (DOWN, INIT, ARMED, ACTIVE or ACTIVE_DEFER)
+ phys_state - Port physical state (Sleep, Polling, LinkUp, etc)
+ There is also a "counters" subdirectory, with files
+ VL15_dropped
+ excessive_buffer_overrun_errors
+ link_downed
+ link_error_recovery
+ local_link_integrity_errors
+ port_rcv_constraint_errors
+ port_rcv_data
+ port_rcv_errors
+ port_rcv_packets
+ port_rcv_remote_physical_errors
+ port_rcv_switch_relay_errors
+ port_xmit_constraint_errors
+ port_xmit_data
+ port_xmit_discards
+ port_xmit_packets
+ symbol_error
+ Each of these files contains the corresponding value from the port's
+ Performance Management PortCounters attribute, as described in
+ section of the InfiniBand Architecture Specification.
+ The "pkeys" and "gids" subdirectories contain one file for each
+ entry in the port's P_Key or GID table respectively. For example,
+ ports/1/pkeys/10 contains the value at index 10 in port 1's P_Key
+ table.
+ The Mellanox HCA driver also creates the files:
+ hw_rev - Hardware revision number
+ fw_ver - Firmware version
+ hca_type - HCA type: "MT23108", "MT25208 (MT23108 compat mode)",
+ or "MT25208"
diff --git a/Documentation/infiniband/user_mad.txt b/Documentation/infiniband/user_mad.txt
new file mode 100644
index 00000000..8a366959
--- /dev/null
+++ b/Documentation/infiniband/user_mad.txt
@@ -0,0 +1,148 @@
+Device files
+ Each port of each InfiniBand device has a "umad" device and an
+ "issm" device attached. For example, a two-port HCA will have two
+ umad devices and two issm devices, while a switch will have one
+ device of each type (for switch port 0).
+Creating MAD agents
+ A MAD agent can be created by filling in a struct ib_user_mad_reg_req
+ and then calling the IB_USER_MAD_REGISTER_AGENT ioctl on a file
+ descriptor for the appropriate device file. If the registration
+ request succeeds, a 32-bit id will be returned in the structure.
+ For example:
+ struct ib_user_mad_reg_req req = { /* ... */ };
+ ret = ioctl(fd, IB_USER_MAD_REGISTER_AGENT, (char *) &req);
+ if (!ret)
+ my_agent = req.id;
+ else
+ perror("agent register");
+ Agents can be unregistered with the IB_USER_MAD_UNREGISTER_AGENT
+ ioctl. Also, all agents registered through a file descriptor will
+ be unregistered when the descriptor is closed.
+Receiving MADs
+ MADs are received using read(). The receive side now supports
+ RMPP. The buffer passed to read() must be at least one
+ struct ib_user_mad + 256 bytes. For example:
+ If the buffer passed is not large enough to hold the received
+ MAD (RMPP), the errno is set to ENOSPC and the length of the
+ buffer needed is set in mad.length.
+ Example for normal MAD (non RMPP) reads:
+ struct ib_user_mad *mad;
+ mad = malloc(sizeof *mad + 256);
+ ret = read(fd, mad, sizeof *mad + 256);
+ if (ret != sizeof mad + 256) {
+ perror("read");
+ free(mad);
+ }
+ Example for RMPP reads:
+ struct ib_user_mad *mad;
+ mad = malloc(sizeof *mad + 256);
+ ret = read(fd, mad, sizeof *mad + 256);
+ if (ret == -ENOSPC)) {
+ length = mad.length;
+ free(mad);
+ mad = malloc(sizeof *mad + length);
+ ret = read(fd, mad, sizeof *mad + length);
+ }
+ if (ret < 0) {
+ perror("read");
+ free(mad);
+ }
+ In addition to the actual MAD contents, the other struct ib_user_mad
+ fields will be filled in with information on the received MAD. For
+ example, the remote LID will be in mad.lid.
+ If a send times out, a receive will be generated with mad.status set
+ to ETIMEDOUT. Otherwise when a MAD has been successfully received,
+ mad.status will be 0.
+ poll()/select() may be used to wait until a MAD can be read.
+Sending MADs
+ MADs are sent using write(). The agent ID for sending should be
+ filled into the id field of the MAD, the destination LID should be
+ filled into the lid field, and so on. The send side does support
+ RMPP so arbitrary length MAD can be sent. For example:
+ struct ib_user_mad *mad;
+ mad = malloc(sizeof *mad + mad_length);
+ /* fill in mad->data */
+ mad->hdr.id = my_agent; /* req.id from agent registration */
+ mad->hdr.lid = my_dest; /* in network byte order... */
+ /* etc. */
+ ret = write(fd, &mad, sizeof *mad + mad_length);
+ if (ret != sizeof *mad + mad_length)
+ perror("write");
+Transaction IDs
+ Users of the umad devices can use the lower 32 bits of the
+ transaction ID field (that is, the least significant half of the
+ field in network byte order) in MADs being sent to match
+ request/response pairs. The upper 32 bits are reserved for use by
+ the kernel and will be overwritten before a MAD is sent.
+P_Key Index Handling
+ The old ib_umad interface did not allow setting the P_Key index for
+ MADs that are sent and did not provide a way for obtaining the P_Key
+ index of received MADs. A new layout for struct ib_user_mad_hdr
+ with a pkey_index member has been defined; however, to preserve
+ binary compatibility with older applications, this new layout will
+ not be used unless the IB_USER_MAD_ENABLE_PKEY ioctl is called
+ before a file descriptor is used for anything else.
+ In September 2008, the IB_USER_MAD_ABI_VERSION will be incremented
+ to 6, the new layout of struct ib_user_mad_hdr will be used by
+ default, and the IB_USER_MAD_ENABLE_PKEY ioctl will be removed.
+Setting IsSM Capability Bit
+ To set the IsSM capability bit for a port, simply open the
+ corresponding issm device file. If the IsSM bit is already set,
+ then the open call will block until the bit is cleared (or return
+ immediately with errno set to EAGAIN if the O_NONBLOCK flag is
+ passed to open()). The IsSM bit will be cleared when the issm file
+ is closed. No read, write or other operations can be performed on
+ the issm file.
+/dev files
+ To create the appropriate character device files automatically with
+ udev, a rule like
+ KERNEL=="umad*", NAME="infiniband/%k"
+ KERNEL=="issm*", NAME="infiniband/%k"
+ can be used. This will create device nodes named
+ /dev/infiniband/umad0
+ /dev/infiniband/issm0
+ for the first port, and so on. The InfiniBand device and port
+ associated with these devices can be determined from the files
+ /sys/class/infiniband_mad/umad0/ibdev
+ /sys/class/infiniband_mad/umad0/port
+ and
+ /sys/class/infiniband_mad/issm0/ibdev
+ /sys/class/infiniband_mad/issm0/port
diff --git a/Documentation/infiniband/user_verbs.txt b/Documentation/infiniband/user_verbs.txt
new file mode 100644
index 00000000..e5092d69
--- /dev/null
+++ b/Documentation/infiniband/user_verbs.txt
@@ -0,0 +1,69 @@
+ The ib_uverbs module, built by enabling CONFIG_INFINIBAND_USER_VERBS,
+ enables direct userspace access to IB hardware via "verbs," as
+ described in chapter 11 of the InfiniBand Architecture Specification.
+ To use the verbs, the libibverbs library, available from
+ http://www.openfabrics.org/, is required. libibverbs contains a
+ device-independent API for using the ib_uverbs interface.
+ libibverbs also requires appropriate device-dependent kernel and
+ userspace driver for your InfiniBand hardware. For example, to use
+ a Mellanox HCA, you will need the ib_mthca kernel module and the
+ libmthca userspace driver be installed.
+User-kernel communication
+ Userspace communicates with the kernel for slow path, resource
+ management operations via the /dev/infiniband/uverbsN character
+ devices. Fast path operations are typically performed by writing
+ directly to hardware registers mmap()ed into userspace, with no
+ system call or context switch into the kernel.
+ Commands are sent to the kernel via write()s on these device files.
+ The ABI is defined in drivers/infiniband/include/ib_user_verbs.h.
+ The structs for commands that require a response from the kernel
+ contain a 64-bit field used to pass a pointer to an output buffer.
+ Status is returned to userspace as the return value of the write()
+ system call.
+Resource management
+ Since creation and destruction of all IB resources is done by
+ commands passed through a file descriptor, the kernel can keep track
+ of which resources are attached to a given userspace context. The
+ ib_uverbs module maintains idr tables that are used to translate
+ between kernel pointers and opaque userspace handles, so that kernel
+ pointers are never exposed to userspace and userspace cannot trick
+ the kernel into following a bogus pointer.
+ This also allows the kernel to clean up when a process exits and
+ prevent one process from touching another process's resources.
+Memory pinning
+ Direct userspace I/O requires that memory regions that are potential
+ I/O targets be kept resident at the same physical address. The
+ ib_uverbs module manages pinning and unpinning memory regions via
+ get_user_pages() and put_page() calls. It also accounts for the
+ amount of memory pinned in the process's locked_vm, and checks that
+ unprivileged processes do not exceed their RLIMIT_MEMLOCK limit.
+ Pages that are pinned multiple times are counted each time they are
+ pinned, so the value of locked_vm may be an overestimate of the
+ number of pages pinned by a process.
+/dev files
+ To create the appropriate character device files automatically with
+ udev, a rule like
+ KERNEL=="uverbs*", NAME="infiniband/%k"
+ can be used. This will create device nodes named
+ /dev/infiniband/uverbs0
+ and so on. Since the InfiniBand userspace verbs should be safe for
+ use by non-privileged processes, it may be useful to add an
+ appropriate MODE or GROUP to the udev rule.