aboutsummaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/stable/sysfs-bus-nvmem19
-rw-r--r--Documentation/ABI/testing/configfs-usb-gadget-rndis3
-rw-r--r--Documentation/ABI/testing/procfs-smaps_rollup31
-rw-r--r--Documentation/ABI/testing/sysfs-block-zram8
-rw-r--r--Documentation/ABI/testing/sysfs-bus-iio9
-rw-r--r--Documentation/ABI/testing/sysfs-bus-thunderbolt2
-rw-r--r--Documentation/ABI/testing/sysfs-bus-usb-lvstest13
-rw-r--r--Documentation/ABI/testing/sysfs-driver-altera-cvp8
-rw-r--r--Documentation/ABI/testing/sysfs-kernel-mm-swap26
-rw-r--r--Documentation/ABI/testing/sysfs-power12
-rw-r--r--Documentation/Makefile12
-rw-r--r--Documentation/RCU/Design/Requirements/Requirements.html130
-rw-r--r--Documentation/RCU/checklist.txt121
-rw-r--r--Documentation/RCU/rcu.txt9
-rw-r--r--Documentation/RCU/rcu_dereference.txt61
-rw-r--r--Documentation/RCU/rcubarrier.txt5
-rw-r--r--Documentation/RCU/torture.txt20
-rw-r--r--Documentation/RCU/whatisRCU.txt5
-rw-r--r--Documentation/admin-guide/devices.txt5
-rw-r--r--Documentation/admin-guide/kernel-parameters.rst1
-rw-r--r--Documentation/admin-guide/kernel-parameters.txt19
-rw-r--r--Documentation/admin-guide/pm/cpufreq.rst8
-rw-r--r--Documentation/admin-guide/pm/index.rst12
-rw-r--r--Documentation/admin-guide/pm/intel_pstate.rst61
-rw-r--r--Documentation/admin-guide/pm/sleep-states.rst245
-rw-r--r--Documentation/admin-guide/pm/strategies.rst52
-rw-r--r--Documentation/admin-guide/pm/system-wide.rst8
-rw-r--r--Documentation/admin-guide/pm/working-state.rst9
-rw-r--r--Documentation/arm/firmware.txt2
-rw-r--r--Documentation/arm64/cpu-feature-registers.txt2
-rw-r--r--Documentation/atomic_bitops.txt66
-rw-r--r--Documentation/atomic_t.txt242
-rw-r--r--Documentation/blockdev/zram.txt11
-rw-r--r--Documentation/cgroup-v2.txt221
-rw-r--r--Documentation/conf.py6
-rw-r--r--Documentation/core-api/genalloc.rst144
-rw-r--r--Documentation/core-api/index.rst1
-rw-r--r--Documentation/core-api/kernel-api.rst49
-rw-r--r--Documentation/core-api/workqueue.rst10
-rw-r--r--Documentation/dev-tools/gdb-kernel-debugging.rst6
-rw-r--r--Documentation/dev-tools/kgdb.rst11
-rw-r--r--Documentation/devicetree/bindings/arm/coresight.txt4
-rw-r--r--Documentation/devicetree/bindings/arm/pmu.txt2
-rw-r--r--Documentation/devicetree/bindings/ata/ahci-mtk.txt51
-rw-r--r--Documentation/devicetree/bindings/clock/mt8173-cpu-dvfs.txt83
-rw-r--r--Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek.txt247
-rw-r--r--Documentation/devicetree/bindings/crypto/artpec6-crypto.txt16
-rw-r--r--Documentation/devicetree/bindings/crypto/atmel-crypto.txt13
-rw-r--r--Documentation/devicetree/bindings/crypto/st,stm32-hash.txt30
-rw-r--r--Documentation/devicetree/bindings/display/bridge/dw_mipi_dsi.txt32
-rw-r--r--Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt12
-rw-r--r--Documentation/devicetree/bindings/display/repaper.txt52
-rw-r--r--Documentation/devicetree/bindings/display/rockchip/dw_hdmi-rockchip.txt7
-rw-r--r--Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt4
-rw-r--r--Documentation/devicetree/bindings/display/sitronix,st7586.txt22
-rw-r--r--Documentation/devicetree/bindings/display/st,stm32-ltdc.txt105
-rw-r--r--Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt36
-rw-r--r--Documentation/devicetree/bindings/extcon/extcon-usbc-cros-ec.txt24
-rw-r--r--Documentation/devicetree/bindings/fpga/altera-passive-serial.txt29
-rw-r--r--Documentation/devicetree/bindings/fpga/xilinx-pr-decoupler.txt36
-rw-r--r--Documentation/devicetree/bindings/gpio/gpio-74x164.txt3
-rw-r--r--Documentation/devicetree/bindings/gpio/gpio-aspeed.txt2
-rw-r--r--Documentation/devicetree/bindings/gpio/gpio-davinci.txt91
-rw-r--r--Documentation/devicetree/bindings/gpio/gpio-vf610.txt4
-rw-r--r--Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt16
-rw-r--r--Documentation/devicetree/bindings/hwmon/aspeed-pwm-tacho.txt9
-rw-r--r--Documentation/devicetree/bindings/hwmon/ibm,cffps1.txt21
-rw-r--r--Documentation/devicetree/bindings/hwmon/ltq-cputemp.txt10
-rw-r--r--Documentation/devicetree/bindings/iio/adc/at91-sama5d2_adc.txt6
-rw-r--r--Documentation/devicetree/bindings/iio/adc/mt6577_auxadc.txt1
-rw-r--r--Documentation/devicetree/bindings/iio/adc/rockchip-saradc.txt1
-rw-r--r--Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt5
-rw-r--r--Documentation/devicetree/bindings/iio/dac/st,stm32-dac.txt4
-rw-r--r--Documentation/devicetree/bindings/iio/humidity/hdc100x.txt17
-rw-r--r--Documentation/devicetree/bindings/iio/humidity/hts221.txt11
-rw-r--r--Documentation/devicetree/bindings/iio/humidity/htu21.txt13
-rw-r--r--Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt8
-rw-r--r--Documentation/devicetree/bindings/iio/pressure/ms5637.txt17
-rw-r--r--Documentation/devicetree/bindings/iio/st-sensors.txt3
-rw-r--r--Documentation/devicetree/bindings/iio/temperature/tsys01.txt19
-rw-r--r--Documentation/devicetree/bindings/iio/timer/stm32-timer-trigger.txt6
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/fsl,ls-scfg-msi.txt8
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/socionext,uniphier-aidet.txt32
-rw-r--r--Documentation/devicetree/bindings/mfd/wm831x.txt1
-rw-r--r--Documentation/devicetree/bindings/net/anarion-gmac.txt25
-rw-r--r--Documentation/devicetree/bindings/net/broadcom-bluetooth.txt35
-rw-r--r--Documentation/devicetree/bindings/net/dwmac-sun8i.txt84
-rw-r--r--Documentation/devicetree/bindings/net/marvell-pp2.txt29
-rw-r--r--Documentation/devicetree/bindings/net/mediatek-net.txt12
-rw-r--r--Documentation/devicetree/bindings/net/phy.txt5
-rw-r--r--Documentation/devicetree/bindings/net/renesas,ravb.txt30
-rw-r--r--Documentation/devicetree/bindings/net/rockchip-dwmac.txt1
-rw-r--r--Documentation/devicetree/bindings/net/xilinx_axienet.txt55
-rw-r--r--Documentation/devicetree/bindings/phy/phy-mtk-tphy.txt (renamed from Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt)17
-rw-r--r--Documentation/devicetree/bindings/phy/phy-mvebu-comphy.txt43
-rw-r--r--Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt11
-rw-r--r--Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt11
-rw-r--r--Documentation/devicetree/bindings/phy/ralink-usb-phy.txt23
-rw-r--r--Documentation/devicetree/bindings/phy/sun4i-usb-phy.txt10
-rw-r--r--Documentation/devicetree/bindings/pinctrl/cortina,gemini-pinctrl.txt59
-rw-r--r--Documentation/devicetree/bindings/pinctrl/fsl,imx7ulp-pinctrl.txt61
-rw-r--r--Documentation/devicetree/bindings/pinctrl/pinctrl-aspeed.txt8
-rw-r--r--Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt2
-rw-r--r--Documentation/devicetree/bindings/pinctrl/pinctrl-mt65xx.txt1
-rw-r--r--Documentation/devicetree/bindings/pinctrl/qcom,apq8064-pinctrl.txt3
-rw-r--r--Documentation/devicetree/bindings/pinctrl/qcom,ipq4019-pinctrl.txt6
-rw-r--r--Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt26
-rw-r--r--Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt1
-rw-r--r--Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt1
-rw-r--r--Documentation/devicetree/bindings/pinctrl/sprd,pinctrl.txt83
-rw-r--r--Documentation/devicetree/bindings/pinctrl/sprd,sc9860-pinctrl.txt70
-rw-r--r--Documentation/devicetree/bindings/power/rockchip-io-domain.txt2
-rw-r--r--Documentation/devicetree/bindings/regulator/mt6311-regulator.txt2
-rw-r--r--Documentation/devicetree/bindings/regulator/mt6323-regulator.txt2
-rw-r--r--Documentation/devicetree/bindings/regulator/mt6380-regulator.txt89
-rw-r--r--Documentation/devicetree/bindings/regulator/mt6397-regulator.txt2
-rw-r--r--Documentation/devicetree/bindings/regulator/pwm-regulator.txt2
-rw-r--r--Documentation/devicetree/bindings/regulator/st,stm32-vrefbuf.txt20
-rw-r--r--Documentation/devicetree/bindings/rng/imx-rngc.txt21
-rw-r--r--Documentation/devicetree/bindings/serial/8250.txt2
-rw-r--r--Documentation/devicetree/bindings/serial/renesas,sci-serial.txt2
-rw-r--r--Documentation/devicetree/bindings/serial/rs485.txt5
-rw-r--r--Documentation/devicetree/bindings/serial/st,stm32-usart.txt17
-rw-r--r--Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt1
-rw-r--r--Documentation/devicetree/bindings/spi/sh-msiof.txt1
-rw-r--r--Documentation/devicetree/bindings/spi/spi-rockchip.txt1
-rw-r--r--Documentation/devicetree/bindings/timer/nxp,tpm-timer.txt28
-rw-r--r--Documentation/devicetree/bindings/timer/renesas,cmt.txt73
-rw-r--r--Documentation/devicetree/bindings/usb/brcm,bdc.txt29
-rw-r--r--Documentation/devicetree/bindings/usb/fcs,fusb302.txt29
-rw-r--r--Documentation/devicetree/bindings/usb/keystone-usb.txt17
-rw-r--r--Documentation/devicetree/bindings/usb/mediatek,mtk-xhci.txt (renamed from Documentation/devicetree/bindings/usb/mt8173-xhci.txt)14
-rw-r--r--Documentation/devicetree/bindings/usb/mediatek,mtu3.txt (renamed from Documentation/devicetree/bindings/usb/mt8173-mtu3.txt)8
-rw-r--r--Documentation/devicetree/bindings/usb/renesas_usb3.txt16
-rw-r--r--Documentation/devicetree/bindings/vendor-prefixes.txt2
-rw-r--r--Documentation/devicetree/bindings/xilinx.txt2
-rw-r--r--Documentation/doc-guide/sphinx.rst107
-rw-r--r--Documentation/driver-api/basics.rst5
-rw-r--r--Documentation/driver-api/dma-buf.rst3
-rw-r--r--Documentation/driver-api/gpio.rst45
-rw-r--r--Documentation/driver-api/index.rst1
-rw-r--r--Documentation/driver-api/miscellaneous.rst1
-rw-r--r--Documentation/driver-api/s390-drivers.rst2
-rw-r--r--Documentation/driver-api/scsi.rst8
-rw-r--r--Documentation/driver-model/devres.txt1
-rw-r--r--Documentation/errseq.rst149
-rw-r--r--Documentation/features/core/tracehook/arch-support.txt2
-rw-r--r--Documentation/filesystems/caching/netfs-api.txt2
-rw-r--r--Documentation/filesystems/dax.txt5
-rw-r--r--Documentation/filesystems/vfs.txt4
-rw-r--r--Documentation/gpu/drm-internals.rst2
-rw-r--r--Documentation/gpu/drm-kms-helpers.rst9
-rw-r--r--Documentation/gpu/drm-kms.rst59
-rw-r--r--Documentation/gpu/drm-mm.rst4
-rw-r--r--Documentation/gpu/drm-uapi.rst2
-rw-r--r--Documentation/gpu/i915.rst18
-rw-r--r--Documentation/gpu/todo.rst4
-rw-r--r--Documentation/hwmon/ftsteutates4
-rw-r--r--Documentation/hwmon/ibm-cffps54
-rw-r--r--Documentation/hwmon/lm250669
-rw-r--r--Documentation/iio/ep93xx_adc.txt29
-rw-r--r--Documentation/infiniband/tag_matching.txt64
-rw-r--r--Documentation/input/input.rst2
-rw-r--r--Documentation/input/joydev/index.rst1
-rw-r--r--Documentation/kbuild/makefiles.txt6
-rw-r--r--Documentation/locking/crossrelease.txt874
-rw-r--r--Documentation/locking/rt-mutex-design.txt432
-rw-r--r--Documentation/locking/rt-mutex.txt58
-rw-r--r--Documentation/media/uapi/cec/cec-funcs.rst1
-rw-r--r--Documentation/memory-barriers.txt142
-rw-r--r--Documentation/networking/00-INDEX2
-rw-r--r--Documentation/networking/batman-adv.rst220
-rw-r--r--Documentation/networking/batman-adv.txt215
-rw-r--r--Documentation/networking/dpaa.txt68
-rw-r--r--Documentation/networking/filter.txt130
-rw-r--r--Documentation/networking/hinic.txt125
-rw-r--r--Documentation/networking/ieee802154.txt16
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/ip-sysctl.txt29
-rw-r--r--Documentation/networking/msg_zerocopy.rst257
-rw-r--r--Documentation/networking/netdev-FAQ.txt8
-rw-r--r--Documentation/networking/netvsc.txt75
-rw-r--r--Documentation/networking/nf_conntrack-sysctl.txt11
-rw-r--r--Documentation/networking/rmnet.txt82
-rw-r--r--Documentation/networking/rxrpc.txt57
-rw-r--r--Documentation/networking/strparser.txt209
-rw-r--r--Documentation/nvmem/nvmem.txt2
-rw-r--r--Documentation/power/states.txt125
-rw-r--r--Documentation/printk-formats.txt10
-rw-r--r--Documentation/process/applying-patches.rst43
-rw-r--r--Documentation/process/changes.rst16
-rw-r--r--Documentation/process/stable-kernel-rules.rst4
-rw-r--r--Documentation/process/submitting-patches.rst2
-rw-r--r--Documentation/security/keys/core.rst16
-rw-r--r--Documentation/security/keys/request-key.rst2
-rw-r--r--Documentation/security/keys/trusted-encrypted.rst2
-rw-r--r--Documentation/sphinx-static/theme_overrides.css17
-rw-r--r--Documentation/sphinx/kerneldoc.py8
-rw-r--r--Documentation/sphinx/requirements.txt3
-rw-r--r--Documentation/static-keys.txt20
-rw-r--r--Documentation/sysctl/kernel.txt13
-rw-r--r--Documentation/sysctl/net.txt2
-rw-r--r--Documentation/sysctl/vm.txt4
-rw-r--r--Documentation/trace/stm.txt13
-rw-r--r--Documentation/translations/ko_KR/memory-barriers.txt5
-rw-r--r--Documentation/translations/zh_CN/HOWTO2
-rw-r--r--Documentation/vm/numa7
-rw-r--r--Documentation/vm/swap_numa.txt69
-rw-r--r--Documentation/x86/early-microcode.txt70
-rw-r--r--Documentation/x86/intel_rdt_ui.txt323
-rw-r--r--Documentation/x86/microcode.txt137
-rw-r--r--Documentation/x86/orc-unwinder.txt179
212 files changed, 6868 insertions, 1737 deletions
diff --git a/Documentation/ABI/stable/sysfs-bus-nvmem b/Documentation/ABI/stable/sysfs-bus-nvmem
new file mode 100644
index 000000000000..5923ab4620c5
--- /dev/null
+++ b/Documentation/ABI/stable/sysfs-bus-nvmem
@@ -0,0 +1,19 @@
+What: /sys/bus/nvmem/devices/.../nvmem
+Date: July 2015
+KernelVersion: 4.2
+Contact: Srinivas Kandagatla <srinivas.kandagatla@linaro.org>
+Description:
+ This file allows user to read/write the raw NVMEM contents.
+ Permissions for write to this file depends on the nvmem
+ provider configuration.
+
+ ex:
+ hexdump /sys/bus/nvmem/devices/qfprom0/nvmem
+
+ 0000000 0000 0000 0000 0000 0000 0000 0000 0000
+ *
+ 00000a0 db10 2240 0000 e000 0c00 0c00 0000 0c00
+ 0000000 0000 0000 0000 0000 0000 0000 0000 0000
+ ...
+ *
+ 0001000
diff --git a/Documentation/ABI/testing/configfs-usb-gadget-rndis b/Documentation/ABI/testing/configfs-usb-gadget-rndis
index e32879b84b4d..137399095d74 100644
--- a/Documentation/ABI/testing/configfs-usb-gadget-rndis
+++ b/Documentation/ABI/testing/configfs-usb-gadget-rndis
@@ -12,3 +12,6 @@ Description:
Ethernet over USB link
dev_addr - MAC address of device's end of this
Ethernet over USB link
+ class - USB interface class, default is 02 (hex)
+ subclass - USB interface subclass, default is 06 (hex)
+ protocol - USB interface protocol, default is 00 (hex)
diff --git a/Documentation/ABI/testing/procfs-smaps_rollup b/Documentation/ABI/testing/procfs-smaps_rollup
new file mode 100644
index 000000000000..0a54ed0d63c9
--- /dev/null
+++ b/Documentation/ABI/testing/procfs-smaps_rollup
@@ -0,0 +1,31 @@
+What: /proc/pid/smaps_rollup
+Date: August 2017
+Contact: Daniel Colascione <dancol@google.com>
+Description:
+ This file provides pre-summed memory information for a
+ process. The format is identical to /proc/pid/smaps,
+ except instead of an entry for each VMA in a process,
+ smaps_rollup has a single entry (tagged "[rollup]")
+ for which each field is the sum of the corresponding
+ fields from all the maps in /proc/pid/smaps.
+ For more details, see the procfs man page.
+
+ Typical output looks like this:
+
+ 00100000-ff709000 ---p 00000000 00:00 0 [rollup]
+ Rss: 884 kB
+ Pss: 385 kB
+ Shared_Clean: 696 kB
+ Shared_Dirty: 0 kB
+ Private_Clean: 120 kB
+ Private_Dirty: 68 kB
+ Referenced: 884 kB
+ Anonymous: 68 kB
+ LazyFree: 0 kB
+ AnonHugePages: 0 kB
+ ShmemPmdMapped: 0 kB
+ Shared_Hugetlb: 0 kB
+ Private_Hugetlb: 0 kB
+ Swap: 0 kB
+ SwapPss: 0 kB
+ Locked: 385 kB
diff --git a/Documentation/ABI/testing/sysfs-block-zram b/Documentation/ABI/testing/sysfs-block-zram
index 451b6d882b2c..c1513c756af1 100644
--- a/Documentation/ABI/testing/sysfs-block-zram
+++ b/Documentation/ABI/testing/sysfs-block-zram
@@ -90,3 +90,11 @@ Description:
device's debugging info useful for kernel developers. Its
format is not documented intentionally and may change
anytime without any notice.
+
+What: /sys/block/zram<id>/backing_dev
+Date: June 2017
+Contact: Minchan Kim <minchan@kernel.org>
+Description:
+ The backing_dev file is read-write and set up backing
+ device for zram to write incompressible pages.
+ For using, user should enable CONFIG_ZRAM_WRITEBACK.
diff --git a/Documentation/ABI/testing/sysfs-bus-iio b/Documentation/ABI/testing/sysfs-bus-iio
index 2db2cdf42d54..7eead5f97e02 100644
--- a/Documentation/ABI/testing/sysfs-bus-iio
+++ b/Documentation/ABI/testing/sysfs-bus-iio
@@ -119,6 +119,15 @@ Description:
unique to allow association with event codes. Units after
application of scale and offset are milliamps.
+What: /sys/bus/iio/devices/iio:deviceX/in_powerY_raw
+KernelVersion: 4.5
+Contact: linux-iio@vger.kernel.org
+Description:
+ Raw (unscaled no bias removal etc.) power measurement from
+ channel Y. The number must always be specified and
+ unique to allow association with event codes. Units after
+ application of scale and offset are milliwatts.
+
What: /sys/bus/iio/devices/iio:deviceX/in_capacitanceY_raw
KernelVersion: 3.2
Contact: linux-iio@vger.kernel.org
diff --git a/Documentation/ABI/testing/sysfs-bus-thunderbolt b/Documentation/ABI/testing/sysfs-bus-thunderbolt
index 2a98149943ea..392bef5bd399 100644
--- a/Documentation/ABI/testing/sysfs-bus-thunderbolt
+++ b/Documentation/ABI/testing/sysfs-bus-thunderbolt
@@ -45,6 +45,8 @@ Contact: thunderbolt-software@lists.01.org
Description: When a devices supports Thunderbolt secure connect it will
have this attribute. Writing 32 byte hex string changes
authorization to use the secure connection method instead.
+ Writing an empty string clears the key and regular connection
+ method can be used again.
What: /sys/bus/thunderbolt/devices/.../device
Date: Sep 2017
diff --git a/Documentation/ABI/testing/sysfs-bus-usb-lvstest b/Documentation/ABI/testing/sysfs-bus-usb-lvstest
index 5151290cf8e7..ee0046dc4192 100644
--- a/Documentation/ABI/testing/sysfs-bus-usb-lvstest
+++ b/Documentation/ABI/testing/sysfs-bus-usb-lvstest
@@ -45,3 +45,16 @@ Contact: Pratyush Anand <pratyush.anand@gmail.com>
Description:
Write to this node to issue "U3 exit" for Link Layer
Validation device. It is needed for TD.7.36.
+
+What: /sys/bus/usb/devices/.../enable_compliance
+Date: July 2017
+Description:
+ Write to this node to set the port to compliance mode to test
+ with Link Layer Validation device. It is needed for TD.7.34.
+
+What: /sys/bus/usb/devices/.../warm_reset
+Date: July 2017
+Description:
+ Write to this node to issue "Warm Reset" for Link Layer Validation
+ device. It may be needed to properly reset an xHCI 1.1 host port if
+ compliance mode needed to be explicitly enabled.
diff --git a/Documentation/ABI/testing/sysfs-driver-altera-cvp b/Documentation/ABI/testing/sysfs-driver-altera-cvp
new file mode 100644
index 000000000000..8cde64a71edb
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-driver-altera-cvp
@@ -0,0 +1,8 @@
+What: /sys/bus/pci/drivers/altera-cvp/chkcfg
+Date: May 2017
+Kernel Version: 4.13
+Contact: Anatolij Gustschin <agust@denx.de>
+Description:
+ Contains either 1 or 0 and controls if configuration
+ error checking in altera-cvp driver is turned on or
+ off.
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-swap b/Documentation/ABI/testing/sysfs-kernel-mm-swap
new file mode 100644
index 000000000000..587db52084c7
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-swap
@@ -0,0 +1,26 @@
+What: /sys/kernel/mm/swap/
+Date: August 2017
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: Interface for swapping
+
+What: /sys/kernel/mm/swap/vma_ra_enabled
+Date: August 2017
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: Enable/disable VMA based swap readahead.
+
+ If set to true, the VMA based swap readahead algorithm
+ will be used for swappable anonymous pages mapped in a
+ VMA, and the global swap readahead algorithm will be
+ still used for tmpfs etc. other users. If set to
+ false, the global swap readahead algorithm will be
+ used for all swappable pages.
+
+What: /sys/kernel/mm/swap/vma_ra_max_order
+Date: August 2017
+Contact: Linux memory management mailing list <linux-mm@kvack.org>
+Description: The max readahead size in order for VMA based swap readahead
+
+ VMA based swap readahead algorithm will readahead at
+ most 1 << max_order pages for each readahead. The
+ real readahead size for each readahead will be scaled
+ according to the estimation algorithm.
diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/testing/sysfs-power
index f523e5a3ac33..713cab1d5f12 100644
--- a/Documentation/ABI/testing/sysfs-power
+++ b/Documentation/ABI/testing/sysfs-power
@@ -273,3 +273,15 @@ Description:
This output is useful for system wakeup diagnostics of spurious
wakeup interrupts.
+
+What: /sys/power/pm_debug_messages
+Date: July 2017
+Contact: Rafael J. Wysocki <rjw@rjwysocki.net>
+Description:
+ The /sys/power/pm_debug_messages file controls the printing
+ of debug messages from the system suspend/hiberbation
+ infrastructure to the kernel log.
+
+ Writing a "1" to this file enables the debug messages and
+ writing a "0" (default) to it disables them. Reads from
+ this file return the current value.
diff --git a/Documentation/Makefile b/Documentation/Makefile
index a42320385df3..85f7856f0092 100644
--- a/Documentation/Makefile
+++ b/Documentation/Makefile
@@ -22,6 +22,8 @@ ifeq ($(HAVE_SPHINX),0)
.DEFAULT:
$(warning The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed and in PATH, or set the SPHINXBUILD make variable to point to the full path of the '$(SPHINXBUILD)' executable.)
+ @echo
+ @./scripts/sphinx-pre-install
@echo " SKIP Sphinx $@ target."
else # HAVE_SPHINX
@@ -95,16 +97,6 @@ endif # HAVE_SPHINX
# The following targets are independent of HAVE_SPHINX, and the rules should
# work or silently pass without Sphinx.
-# no-ops for the Sphinx toolchain
-sgmldocs:
- @:
-psdocs:
- @:
-mandocs:
- @:
-installmandocs:
- @:
-
cleandocs:
$(Q)rm -rf $(BUILDDIR)
$(Q)$(MAKE) BUILDDIR=$(abspath $(BUILDDIR)) $(build)=Documentation/media clean
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index 95b30fa25d56..62e847bcdcdd 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2080,6 +2080,8 @@ Some of the relevant points of interest are as follows:
<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
+<li> <a href="#Scheduling-Clock Interrupts and RCU">
+ Scheduling-Clock Interrupts and RCU</a>.
<li> <a href="#Memory Efficiency">Memory Efficiency</a>.
<li> <a href="#Performance, Scalability, Response Time, and Reliability">
Performance, Scalability, Response Time, and Reliability</a>.
@@ -2532,6 +2534,134 @@ I learned of many of these requirements via angry phone calls:
Flaming me on the Linux-kernel mailing list was apparently not
sufficient to fully vent their ire at RCU's energy-efficiency bugs!
+<h3><a name="Scheduling-Clock Interrupts and RCU">
+Scheduling-Clock Interrupts and RCU</a></h3>
+
+<p>
+The kernel transitions between in-kernel non-idle execution, userspace
+execution, and the idle loop.
+Depending on kernel configuration, RCU handles these states differently:
+
+<table border=3>
+<tr><th><tt>HZ</tt> Kconfig</th>
+ <th>In-Kernel</th>
+ <th>Usermode</th>
+ <th>Idle</th></tr>
+<tr><th align="left"><tt>HZ_PERIODIC</tt></th>
+ <td>Can rely on scheduling-clock interrupt.</td>
+ <td>Can rely on scheduling-clock interrupt and its
+ detection of interrupt from usermode.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_IDLE</tt></th>
+ <td>Can rely on scheduling-clock interrupt.</td>
+ <td>Can rely on scheduling-clock interrupt and its
+ detection of interrupt from usermode.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+<tr><th align="left"><tt>NO_HZ_FULL</tt></th>
+ <td>Can only sometimes rely on scheduling-clock interrupt.
+ In other cases, it is necessary to bound kernel execution
+ times and/or use IPIs.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td>
+ <td>Can rely on RCU's dyntick-idle detection.</td></tr>
+</table>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ Why can't <tt>NO_HZ_FULL</tt> in-kernel execution rely on the
+ scheduling-clock interrupt, just like <tt>HZ_PERIODIC</tt>
+ and <tt>NO_HZ_IDLE</tt> do?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ Because, as a performance optimization, <tt>NO_HZ_FULL</tt>
+ does not necessarily re-enable the scheduling-clock interrupt
+ on entry to each and every system call.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+However, RCU must be reliably informed as to whether any given
+CPU is currently in the idle loop, and, for <tt>NO_HZ_FULL</tt>,
+also whether that CPU is executing in usermode, as discussed
+<a href="#Energy Efficiency">earlier</a>.
+It also requires that the scheduling-clock interrupt be enabled when
+RCU needs it to be:
+
+<ol>
+<li> If a CPU is either idle or executing in usermode, and RCU believes
+ it is non-idle, the scheduling-clock tick had better be running.
+ Otherwise, you will get RCU CPU stall warnings. Or at best,
+ very long (11-second) grace periods, with a pointless IPI waking
+ the CPU from time to time.
+<li> If a CPU is in a portion of the kernel that executes RCU read-side
+ critical sections, and RCU believes this CPU to be idle, you will get
+ random memory corruption. <b>DON'T DO THIS!!!</b>
+
+ <br>This is one reason to test with lockdep, which will complain
+ about this sort of thing.
+<li> If a CPU is in a portion of the kernel that is absolutely
+ positively no-joking guaranteed to never execute any RCU read-side
+ critical sections, and RCU believes this CPU to to be idle,
+ no problem. This sort of thing is used by some architectures
+ for light-weight exception handlers, which can then avoid the
+ overhead of <tt>rcu_irq_enter()</tt> and <tt>rcu_irq_exit()</tt>
+ at exception entry and exit, respectively.
+ Some go further and avoid the entireties of <tt>irq_enter()</tt>
+ and <tt>irq_exit()</tt>.
+
+ <br>Just make very sure you are running some of your tests with
+ <tt>CONFIG_PROVE_RCU=y</tt>, just in case one of your code paths
+ was in fact joking about not doing RCU read-side critical sections.
+<li> If a CPU is executing in the kernel with the scheduling-clock
+ interrupt disabled and RCU believes this CPU to be non-idle,
+ and if the CPU goes idle (from an RCU perspective) every few
+ jiffies, no problem. It is usually OK for there to be the
+ occasional gap between idle periods of up to a second or so.
+
+ <br>If the gap grows too long, you get RCU CPU stall warnings.
+<li> If a CPU is either idle or executing in usermode, and RCU believes
+ it to be idle, of course no problem.
+<li> If a CPU is executing in the kernel, the kernel code
+ path is passing through quiescent states at a reasonable
+ frequency (preferably about once per few jiffies, but the
+ occasional excursion to a second or so is usually OK) and the
+ scheduling-clock interrupt is enabled, of course no problem.
+
+ <br>If the gap between a successive pair of quiescent states grows
+ too long, you get RCU CPU stall warnings.
+</ol>
+
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+ But what if my driver has a hardware interrupt handler
+ that can run for many seconds?
+ I cannot invoke <tt>schedule()</tt> from an hardware
+ interrupt handler, after all!
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+ One approach is to do <tt>rcu_irq_exit();rcu_irq_enter();</tt>
+ every so often.
+ But given that long-running interrupt handlers can cause
+ other problems, not least for response time, shouldn't you
+ work to keep your interrupt handler's runtime within reasonable
+ bounds?
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+
+<p>
+But as long as RCU is properly informed of kernel state transitions between
+in-kernel execution, usermode execution, and idle, and as long as the
+scheduling-clock interrupt is enabled when RCU needs it to be, you
+can rest assured that the bugs you encounter will be in some other
+part of RCU or some other part of the kernel!
+
<h3><a name="Memory Efficiency">Memory Efficiency</a></h3>
<p>
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 6beda556faf3..49747717d905 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -23,6 +23,14 @@ over a rather long period of time, but improvements are always welcome!
Yet another exception is where the low real-time latency of RCU's
read-side primitives is critically important.
+ One final exception is where RCU readers are used to prevent
+ the ABA problem (https://en.wikipedia.org/wiki/ABA_problem)
+ for lockless updates. This does result in the mildly
+ counter-intuitive situation where rcu_read_lock() and
+ rcu_read_unlock() are used to protect updates, however, this
+ approach provides the same potential simplifications that garbage
+ collectors do.
+
1. Does the update code have proper mutual exclusion?
RCU does allow -readers- to run (almost) naked, but -writers- must
@@ -40,7 +48,9 @@ over a rather long period of time, but improvements are always welcome!
explain how this single task does not become a major bottleneck on
big multiprocessor machines (for example, if the task is updating
information relating to itself that other tasks can read, there
- by definition can be no bottleneck).
+ by definition can be no bottleneck). Note that the definition
+ of "large" has changed significantly: Eight CPUs was "large"
+ in the year 2000, but a hundred CPUs was unremarkable in 2017.
2. Do the RCU read-side critical sections make proper use of
rcu_read_lock() and friends? These primitives are needed
@@ -55,6 +65,12 @@ over a rather long period of time, but improvements are always welcome!
Disabling of preemption can serve as rcu_read_lock_sched(), but
is less readable.
+ Letting RCU-protected pointers "leak" out of an RCU read-side
+ critical section is every bid as bad as letting them leak out
+ from under a lock. Unless, of course, you have arranged some
+ other means of protection, such as a lock or a reference count
+ -before- letting them out of the RCU read-side critical section.
+
3. Does the update code tolerate concurrent accesses?
The whole point of RCU is to permit readers to run without
@@ -78,10 +94,10 @@ over a rather long period of time, but improvements are always welcome!
This works quite well, also.
- c. Make updates appear atomic to readers. For example,
+ c. Make updates appear atomic to readers. For example,
pointer updates to properly aligned fields will
appear atomic, as will individual atomic primitives.
- Sequences of perations performed under a lock will -not-
+ Sequences of operations performed under a lock will -not-
appear to be atomic to RCU readers, nor will sequences
of multiple atomic primitives.
@@ -168,8 +184,8 @@ over a rather long period of time, but improvements are always welcome!
5. If call_rcu(), or a related primitive such as call_rcu_bh(),
call_rcu_sched(), or call_srcu() is used, the callback function
- must be written to be called from softirq context. In particular,
- it cannot block.
+ will be called from softirq context. In particular, it cannot
+ block.
6. Since synchronize_rcu() can block, it cannot be called from
any sort of irq context. The same rule applies for
@@ -178,11 +194,14 @@ over a rather long period of time, but improvements are always welcome!
synchronize_sched_expedite(), and synchronize_srcu_expedited().
The expedited forms of these primitives have the same semantics
- as the non-expedited forms, but expediting is both expensive
- and unfriendly to real-time workloads. Use of the expedited
- primitives should be restricted to rare configuration-change
- operations that would not normally be undertaken while a real-time
- workload is running.
+ as the non-expedited forms, but expediting is both expensive and
+ (with the exception of synchronize_srcu_expedited()) unfriendly
+ to real-time workloads. Use of the expedited primitives should
+ be restricted to rare configuration-change operations that would
+ not normally be undertaken while a real-time workload is running.
+ However, real-time workloads can use rcupdate.rcu_normal kernel
+ boot parameter to completely disable expedited grace periods,
+ though this might have performance implications.
In particular, if you find yourself invoking one of the expedited
primitives repeatedly in a loop, please do everyone a favor:
@@ -193,11 +212,6 @@ over a rather long period of time, but improvements are always welcome!
of the system, especially to real-time workloads running on
the rest of the system.
- In addition, it is illegal to call the expedited forms from
- a CPU-hotplug notifier, or while holding a lock that is acquired
- by a CPU-hotplug notifier. Failing to observe this restriction
- will result in deadlock.
-
7. If the updater uses call_rcu() or synchronize_rcu(), then the
corresponding readers must use rcu_read_lock() and
rcu_read_unlock(). If the updater uses call_rcu_bh() or
@@ -321,7 +335,7 @@ over a rather long period of time, but improvements are always welcome!
Similarly, disabling preemption is not an acceptable substitute
for rcu_read_lock(). Code that attempts to use preemption
disabling where it should be using rcu_read_lock() will break
- in real-time kernel builds.
+ in CONFIG_PREEMPT=y kernel builds.
If you want to wait for interrupt handlers, NMI handlers, and
code under the influence of preempt_disable(), you instead
@@ -356,23 +370,22 @@ over a rather long period of time, but improvements are always welcome!
not the case, a self-spawning RCU callback would prevent the
victim CPU from ever going offline.)
-14. SRCU (srcu_read_lock(), srcu_read_unlock(), srcu_dereference(),
- synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu())
- may only be invoked from process context. Unlike other forms of
- RCU, it -is- permissible to block in an SRCU read-side critical
- section (demarked by srcu_read_lock() and srcu_read_unlock()),
- hence the "SRCU": "sleepable RCU". Please note that if you
- don't need to sleep in read-side critical sections, you should be
- using RCU rather than SRCU, because RCU is almost always faster
- and easier to use than is SRCU.
-
- Also unlike other forms of RCU, explicit initialization
- and cleanup is required via init_srcu_struct() and
- cleanup_srcu_struct(). These are passed a "struct srcu_struct"
- that defines the scope of a given SRCU domain. Once initialized,
- the srcu_struct is passed to srcu_read_lock(), srcu_read_unlock()
- synchronize_srcu(), synchronize_srcu_expedited(), and call_srcu().
- A given synchronize_srcu() waits only for SRCU read-side critical
+14. Unlike other forms of RCU, it -is- permissible to block in an
+ SRCU read-side critical section (demarked by srcu_read_lock()
+ and srcu_read_unlock()), hence the "SRCU": "sleepable RCU".
+ Please note that if you don't need to sleep in read-side critical
+ sections, you should be using RCU rather than SRCU, because RCU
+ is almost always faster and easier to use than is SRCU.
+
+ Also unlike other forms of RCU, explicit initialization and
+ cleanup is required either at build time via DEFINE_SRCU()
+ or DEFINE_STATIC_SRCU() or at runtime via init_srcu_struct()
+ and cleanup_srcu_struct(). These last two are passed a
+ "struct srcu_struct" that defines the scope of a given
+ SRCU domain. Once initialized, the srcu_struct is passed
+ to srcu_read_lock(), srcu_read_unlock() synchronize_srcu(),
+ synchronize_srcu_expedited(), and call_srcu(). A given
+ synchronize_srcu() waits only for SRCU read-side critical
sections governed by srcu_read_lock() and srcu_read_unlock()
calls that have been passed the same srcu_struct. This property
is what makes sleeping read-side critical sections tolerable --
@@ -390,10 +403,16 @@ over a rather long period of time, but improvements are always welcome!
Therefore, SRCU should be used in preference to rw_semaphore
only in extremely read-intensive situations, or in situations
requiring SRCU's read-side deadlock immunity or low read-side
- realtime latency.
+ realtime latency. You should also consider percpu_rw_semaphore
+ when you need lightweight readers.
- Note that, rcu_assign_pointer() relates to SRCU just as it does
- to other forms of RCU.
+ SRCU's expedited primitive (synchronize_srcu_expedited())
+ never sends IPIs to other CPUs, so it is easier on
+ real-time workloads than is synchronize_rcu_expedited(),
+ synchronize_rcu_bh_expedited() or synchronize_sched_expedited().
+
+ Note that rcu_dereference() and rcu_assign_pointer() relate to
+ SRCU just as they do to other forms of RCU.
15. The whole point of call_rcu(), synchronize_rcu(), and friends
is to wait until all pre-existing readers have finished before
@@ -435,3 +454,33 @@ over a rather long period of time, but improvements are always welcome!
These debugging aids can help you find problems that are
otherwise extremely difficult to spot.
+
+18. If you register a callback using call_rcu(), call_rcu_bh(),
+ call_rcu_sched(), or call_srcu(), and pass in a function defined
+ within a loadable module, then it in necessary to wait for
+ all pending callbacks to be invoked after the last invocation
+ and before unloading that module. Note that it is absolutely
+ -not- sufficient to wait for a grace period! The current (say)
+ synchronize_rcu() implementation waits only for all previous
+ callbacks registered on the CPU that synchronize_rcu() is running
+ on, but it is -not- guaranteed to wait for callbacks registered
+ on other CPUs.
+
+ You instead need to use one of the barrier functions:
+
+ o call_rcu() -> rcu_barrier()
+ o call_rcu_bh() -> rcu_barrier_bh()
+ o call_rcu_sched() -> rcu_barrier_sched()
+ o call_srcu() -> srcu_barrier()
+
+ However, these barrier functions are absolutely -not- guaranteed
+ to wait for a grace period. In fact, if there are no call_rcu()
+ callbacks waiting anywhere in the system, rcu_barrier() is within
+ its rights to return immediately.
+
+ So if you need to wait for both an RCU grace period and for
+ all pre-existing call_rcu() callbacks, you will need to execute
+ both rcu_barrier() and synchronize_rcu(), if necessary, using
+ something like workqueues to to execute them concurrently.
+
+ See rcubarrier.txt for more information.
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt
index 745f429fda79..7d4ae110c2c9 100644
--- a/Documentation/RCU/rcu.txt
+++ b/Documentation/RCU/rcu.txt
@@ -76,15 +76,12 @@ o I hear that RCU is patented? What is with that?
Of these, one was allowed to lapse by the assignee, and the
others have been contributed to the Linux kernel under GPL.
There are now also LGPL implementations of user-level RCU
- available (http://lttng.org/?q=node/18).
+ available (http://liburcu.org/).
o I hear that RCU needs work in order to support realtime kernels?
- This work is largely completed. Realtime-friendly RCU can be
- enabled via the CONFIG_PREEMPT_RCU kernel configuration
- parameter. However, work is in progress for enabling priority
- boosting of preempted RCU read-side critical sections. This is
- needed if you have CPU-bound realtime threads.
+ Realtime-friendly RCU can be enabled via the CONFIG_PREEMPT_RCU
+ kernel configuration parameter.
o Where can I find more information on RCU?
diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
index b2a613f16d74..1acb26b09b48 100644
--- a/Documentation/RCU/rcu_dereference.txt
+++ b/Documentation/RCU/rcu_dereference.txt
@@ -25,35 +25,35 @@ o You must use one of the rcu_dereference() family of primitives
for an example where the compiler can in fact deduce the exact
value of the pointer, and thus cause misordering.
+o You are only permitted to use rcu_dereference on pointer values.
+ The compiler simply knows too much about integral values to
+ trust it to carry dependencies through integer operations.
+ There are a very few exceptions, namely that you can temporarily
+ cast the pointer to uintptr_t in order to:
+
+ o Set bits and clear bits down in the must-be-zero low-order
+ bits of that pointer. This clearly means that the pointer
+ must have alignment constraints, for example, this does
+ -not- work in general for char* pointers.
+
+ o XOR bits to translate pointers, as is done in some
+ classic buddy-allocator algorithms.
+
+ It is important to cast the value back to pointer before
+ doing much of anything else with it.
+
o Avoid cancellation when using the "+" and "-" infix arithmetic
operators. For example, for a given variable "x", avoid
- "(x-x)". There are similar arithmetic pitfalls from other
- arithmetic operators, such as "(x*0)", "(x/(x+1))" or "(x%1)".
- The compiler is within its rights to substitute zero for all of
- these expressions, so that subsequent accesses no longer depend
- on the rcu_dereference(), again possibly resulting in bugs due
- to misordering.
+ "(x-(uintptr_t)x)" for char* pointers. The compiler is within its
+ rights to substitute zero for this sort of expression, so that
+ subsequent accesses no longer depend on the rcu_dereference(),
+ again possibly resulting in bugs due to misordering.
Of course, if "p" is a pointer from rcu_dereference(), and "a"
and "b" are integers that happen to be equal, the expression
"p+a-b" is safe because its value still necessarily depends on
the rcu_dereference(), thus maintaining proper ordering.
-o Avoid all-zero operands to the bitwise "&" operator, and
- similarly avoid all-ones operands to the bitwise "|" operator.
- If the compiler is able to deduce the value of such operands,
- it is within its rights to substitute the corresponding constant
- for the bitwise operation. Once again, this causes subsequent
- accesses to no longer depend on the rcu_dereference(), causing
- bugs due to misordering.
-
- Please note that single-bit operands to bitwise "&" can also
- be dangerous. At this point, the compiler knows that the
- resulting value can only take on one of two possible values.
- Therefore, a very small amount of additional information will
- allow the compiler to deduce the exact value, which again can
- result in misordering.
-
o If you are using RCU to protect JITed functions, so that the
"()" function-invocation operator is applied to a value obtained
(directly or indirectly) from rcu_dereference(), you may need to
@@ -61,25 +61,6 @@ o If you are using RCU to protect JITed functions, so that the
This issue arises on some systems when a newly JITed function is
using the same memory that was used by an earlier JITed function.
-o Do not use the results from the boolean "&&" and "||" when
- dereferencing. For example, the following (rather improbable)
- code is buggy:
-
- int *p;
- int *q;
-
- ...
-
- p = rcu_dereference(gp)
- q = &global_q;
- q += p != &oom_p1 && p != &oom_p2;
- r1 = *q; /* BUGGY!!! */
-
- The reason this is buggy is that "&&" and "||" are often compiled
- using branches. While weak-memory machines such as ARM or PowerPC
- do order stores after such branches, they can speculate loads,
- which can result in misordering bugs.
-
o Do not use the results from relational operators ("==", "!=",
">", ">=", "<", or "<=") when dereferencing. For example,
the following (quite strange) code is buggy:
diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt
index b10cfe711e68..5d7759071a3e 100644
--- a/Documentation/RCU/rcubarrier.txt
+++ b/Documentation/RCU/rcubarrier.txt
@@ -263,6 +263,11 @@ Quick Quiz #2: What happens if CPU 0's rcu_barrier_func() executes
are delayed for a full grace period? Couldn't this result in
rcu_barrier() returning prematurely?
+The current rcu_barrier() implementation is more complex, due to the need
+to avoid disturbing idle CPUs (especially on battery-powered systems)
+and the need to minimally disturb non-idle CPUs in real-time systems.
+However, the code above illustrates the concepts.
+
rcu_barrier() Summary
diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt
index 278f6a9383b6..55918b54808b 100644
--- a/Documentation/RCU/torture.txt
+++ b/Documentation/RCU/torture.txt
@@ -276,15 +276,17 @@ o "Free-Block Circulation": Shows the number of torture structures
somehow gets incremented farther than it should.
Different implementations of RCU can provide implementation-specific
-additional information. For example, SRCU provides the following
+additional information. For example, Tree SRCU provides the following
additional line:
- srcu-torture: per-CPU(idx=1): 0(0,1) 1(0,1) 2(0,0) 3(0,1)
+ srcud-torture: Tree SRCU per-CPU(idx=0): 0(35,-21) 1(-4,24) 2(1,1) 3(-26,20) 4(28,-47) 5(-9,4) 6(-10,14) 7(-14,11) T(1,6)
-This line shows the per-CPU counter state. The numbers in parentheses are
-the values of the "old" and "current" counters for the corresponding CPU.
-The "idx" value maps the "old" and "current" values to the underlying
-array, and is useful for debugging.
+This line shows the per-CPU counter state, in this case for Tree SRCU
+using a dynamically allocated srcu_struct (hence "srcud-" rather than
+"srcu-"). The numbers in parentheses are the values of the "old" and
+"current" counters for the corresponding CPU. The "idx" value maps the
+"old" and "current" values to the underlying array, and is useful for
+debugging. The final "T" entry contains the totals of the counters.
USAGE
@@ -304,3 +306,9 @@ checked for such errors. The "rmmod" command forces a "SUCCESS",
"FAILURE", or "RCU_HOTPLUG" indication to be printk()ed. The first
two are self-explanatory, while the last indicates that while there
were no RCU failures, CPU-hotplug problems were detected.
+
+However, the tools/testing/selftests/rcutorture/bin/kvm.sh script
+provides better automation, including automatic failure analysis.
+It assumes a qemu/kvm-enabled platform, and runs guest OSes out of initrd.
+See tools/testing/selftests/rcutorture/doc/initrd.txt for instructions
+on setting up such an initrd.
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 8ed6c9f6133c..df62466da4e0 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -890,6 +890,8 @@ SRCU: Critical sections Grace period Barrier
srcu_read_lock_held
SRCU: Initialization/cleanup
+ DEFINE_SRCU
+ DEFINE_STATIC_SRCU
init_srcu_struct
cleanup_srcu_struct
@@ -913,7 +915,8 @@ a. Will readers need to block? If so, you need SRCU.
b. What about the -rt patchset? If readers would need to block
in an non-rt kernel, you need SRCU. If readers would block
in a -rt kernel, but not in a non-rt kernel, SRCU is not
- necessary.
+ necessary. (The -rt patchset turns spinlocks into sleeplocks,
+ hence this distinction.)
c. Do you need to treat NMI handlers, hardirq handlers,
and code segments with preemption disabled (whether
diff --git a/Documentation/admin-guide/devices.txt b/Documentation/admin-guide/devices.txt
index 6b71852dadc2..4ec843123cc3 100644
--- a/Documentation/admin-guide/devices.txt
+++ b/Documentation/admin-guide/devices.txt
@@ -3081,3 +3081,8 @@
1 = /dev/osd1 Second OSD Device
...
255 = /dev/osd255 256th OSD Device
+
+ 384-511 char RESERVED FOR DYNAMIC ASSIGNMENT
+ Character devices that request a dynamic allocation of major
+ number will take numbers starting from 511 and downward,
+ once the 234-254 range is full.
diff --git a/Documentation/admin-guide/kernel-parameters.rst b/Documentation/admin-guide/kernel-parameters.rst
index d76ab3907e2b..b2598cc9834c 100644
--- a/Documentation/admin-guide/kernel-parameters.rst
+++ b/Documentation/admin-guide/kernel-parameters.rst
@@ -138,6 +138,7 @@ parameter is applicable::
PPT Parallel port support is enabled.
PS2 Appropriate PS/2 support is enabled.
RAM RAM disk support is enabled.
+ RDT Intel Resource Director Technology.
S390 S390 architecture is enabled.
SCSI Appropriate SCSI support is enabled.
A lot of drivers have their options described inside
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 372cc66bba23..86b0e8ec8ad7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2644,9 +2644,10 @@
In kernels built with CONFIG_NO_HZ_FULL=y, set
the specified list of CPUs whose tick will be stopped
whenever possible. The boot CPU will be forced outside
- the range to maintain the timekeeping.
- The CPUs in this range must also be included in the
- rcu_nocbs= set.
+ the range to maintain the timekeeping. Any CPUs
+ in this list will have their RCU callbacks offloaded,
+ just as if they had also been called out in the
+ rcu_nocbs= boot parameter.
noiotrap [SH] Disables trapped I/O port accesses.
@@ -2782,7 +2783,7 @@
Allowed values are enable and disable
numa_zonelist_order= [KNL, BOOT] Select zonelist order for NUMA.
- one of ['zone', 'node', 'default'] can be specified
+ 'node', 'default' can be specified
This can be set from sysctl after boot.
See Documentation/sysctl/vm.txt for details.
@@ -3611,6 +3612,12 @@
Run specified binary instead of /init from the ramdisk,
used for early userspace startup. See initrd.
+ rdt= [HW,X86,RDT]
+ Turn on/off individual RDT features. List is:
+ cmt, mbmtotal, mbmlocal, l3cat, l3cdp, l2cat, mba.
+ E.g. to turn on cmt and turn off mba use:
+ rdt=cmt,!mba
+
reboot= [KNL]
Format (x86 or x86_64):
[w[arm] | c[old] | h[ard] | s[oft] | g[pio]] \
@@ -4388,6 +4395,10 @@
decrease the size and leave more room for directly
mapped kernel RAM.
+ vmcp_cma=nn[MG] [KNL,S390]
+ Sets the memory size reserved for contiguous memory
+ allocations for the vmcp device driver.
+
vmhalt= [KNL,S390] Perform z/VM CP command after system halt.
Format: <command>
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst
index 7af83a92d2d6..47153e64dfb5 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -479,14 +479,6 @@ This governor exposes the following tunables:
# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate
-
-``min_sampling_rate``
- The minimum value of ``sampling_rate``.
-
- Equal to 10000 (10 ms) if :c:macro:`CONFIG_NO_HZ_COMMON` and
- :c:data:`tick_nohz_active` are both set or to 20 times the value of
- :c:data:`jiffies` in microseconds otherwise.
-
``up_threshold``
If the estimated CPU load is above this value (in percent), the governor
will set the frequency to the maximum value allowed for the policy.
diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst
index 7f148f76f432..49237ac73442 100644
--- a/Documentation/admin-guide/pm/index.rst
+++ b/Documentation/admin-guide/pm/index.rst
@@ -5,12 +5,6 @@ Power Management
.. toctree::
:maxdepth: 2
- cpufreq
- intel_pstate
-
-.. only:: subproject and html
-
- Indices
- =======
-
- * :ref:`genindex`
+ strategies
+ system-wide
+ working-state
diff --git a/Documentation/admin-guide/pm/intel_pstate.rst b/Documentation/admin-guide/pm/intel_pstate.rst
index 1d6249825efc..d2b6fda3d67b 100644
--- a/Documentation/admin-guide/pm/intel_pstate.rst
+++ b/Documentation/admin-guide/pm/intel_pstate.rst
@@ -167,35 +167,17 @@ is set.
``powersave``
.............
-Without HWP, this P-state selection algorithm generally depends on the
-processor model and/or the system profile setting in the ACPI tables and there
-are two variants of it.
-
-One of them is used with processors from the Atom line and (regardless of the
-processor model) on platforms with the system profile in the ACPI tables set to
-"mobile" (laptops mostly), "tablet", "appliance PC", "desktop", or
-"workstation". It is also used with processors supporting the HWP feature if
-that feature has not been enabled (that is, with the ``intel_pstate=no_hwp``
-argument in the kernel command line). It is similar to the algorithm
+Without HWP, this P-state selection algorithm is similar to the algorithm
implemented by the generic ``schedutil`` scaling governor except that the
utilization metric used by it is based on numbers coming from feedback
registers of the CPU. It generally selects P-states proportional to the
-current CPU utilization, so it is referred to as the "proportional" algorithm.
-
-The second variant of the ``powersave`` P-state selection algorithm, used in all
-of the other cases (generally, on processors from the Core line, so it is
-referred to as the "Core" algorithm), is based on the values read from the APERF
-and MPERF feedback registers and the previously requested target P-state.
-It does not really take CPU utilization into account explicitly, but as a rule
-it causes the CPU P-state to ramp up very quickly in response to increased
-utilization which is generally desirable in server environments.
-
-Regardless of the variant, this algorithm is run by the driver's utilization
-update callback for the given CPU when it is invoked by the CPU scheduler, but
-not more often than every 10 ms (that can be tweaked via ``debugfs`` in `this
-particular case <Tuning Interface in debugfs_>`_). Like in the ``performance``
-case, the hardware configuration is not touched if the new P-state turns out to
-be the same as the current one.
+current CPU utilization.
+
+This algorithm is run by the driver's utilization update callback for the
+given CPU when it is invoked by the CPU scheduler, but not more often than
+every 10 ms. Like in the ``performance`` case, the hardware configuration
+is not touched if the new P-state turns out to be the same as the current
+one.
This is the default P-state selection algorithm if the
:c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option
@@ -720,34 +702,7 @@ P-state is called, the ``ftrace`` filter can be set to to
gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func
<idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func
-Tuning Interface in ``debugfs``
--------------------------------
-
-The ``powersave`` algorithm provided by ``intel_pstate`` for `the Core line of
-processors in the active mode <powersave_>`_ is based on a `PID controller`_
-whose parameters were chosen to address a number of different use cases at the
-same time. However, it still is possible to fine-tune it to a specific workload
-and the ``debugfs`` interface under ``/sys/kernel/debug/pstate_snb/`` is
-provided for this purpose. [Note that the ``pstate_snb`` directory will be
-present only if the specific P-state selection algorithm matching the interface
-in it actually is in use.]
-
-The following files present in that directory can be used to modify the PID
-controller parameters at run time:
-
-| ``deadband``
-| ``d_gain_pct``
-| ``i_gain_pct``
-| ``p_gain_pct``
-| ``sample_rate_ms``
-| ``setpoint``
-
-Note, however, that achieving desirable results this way generally requires
-expert-level understanding of the power vs performance tradeoff, so extra care
-is recommended when attempting to do that.
-
.. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf
.. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html
.. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
-.. _PID controller: https://en.wikipedia.org/wiki/PID_controller
diff --git a/Documentation/admin-guide/pm/sleep-states.rst b/Documentation/admin-guide/pm/sleep-states.rst
new file mode 100644
index 000000000000..1e5c0f00cb2f
--- /dev/null
+++ b/Documentation/admin-guide/pm/sleep-states.rst
@@ -0,0 +1,245 @@
+===================
+System Sleep States
+===================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+Sleep states are global low-power states of the entire system in which user
+space code cannot be executed and the overall system activity is significantly
+reduced.
+
+
+Sleep States That Can Be Supported
+==================================
+
+Depending on its configuration and the capabilities of the platform it runs on,
+the Linux kernel can support up to four system sleep states, includig
+hibernation and up to three variants of system suspend. The sleep states that
+can be supported by the kernel are listed below.
+
+.. _s2idle:
+
+Suspend-to-Idle
+---------------
+
+This is a generic, pure software, light-weight variant of system suspend (also
+referred to as S2I or S2Idle). It allows more energy to be saved relative to
+runtime idle by freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states (possibly lower-power than available in the
+working state), such that the processors can spend time in their deepest idle
+states while the system is suspended.
+
+The system is woken up from this state by in-band interrupts, so theoretically
+any devices that can cause interrupts to be generated in the working state can
+also be set up as wakeup devices for S2Idle.
+
+This state can be used on platforms without support for :ref:`standby <standby>`
+or :ref:`suspend-to-RAM <s2ram>`, or it can be used in addition to any of the
+deeper system suspend variants to provide reduced resume latency. It is always
+supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option is set.
+
+.. _standby:
+
+Standby
+-------
+
+This state, if supported, offers moderate, but real, energy savings, while
+providing a relatively straightforward transition back to the working state. No
+operating state is lost (the system core logic retains power), so the system can
+go back to where it left off easily enough.
+
+In addition to freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states, which is done for :ref:`suspend-to-idle
+<s2idle>` too, nonboot CPUs are taken offline and all low-level system functions
+are suspended during transitions into this state. For this reason, it should
+allow more energy to be saved relative to :ref:`suspend-to-idle <s2idle>`, but
+the resume latency will generally be greater than for that state.
+
+The set of devices that can wake up the system from this state usually is
+reduced relative to :ref:`suspend-to-idle <s2idle>` and it may be necessary to
+rely on the platform for setting up the wakeup functionality as appropriate.
+
+This state is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration
+option is set and the support for it is registered by the platform with the
+core system suspend subsystem. On ACPI-based systems this state is mapped to
+the S1 system state defined by ACPI.
+
+.. _s2ram:
+
+Suspend-to-RAM
+--------------
+
+This state (also referred to as STR or S2RAM), if supported, offers significant
+energy savings as everything in the system is put into a low-power state, except
+for memory, which should be placed into the self-refresh mode to retain its
+contents. All of the steps carried out when entering :ref:`standby <standby>`
+are also carried out during transitions to S2RAM. Additional operations may
+take place depending on the platform capabilities. In particular, on ACPI-based
+systems the kernel passes control to the platform firmware (BIOS) as the last
+step during S2RAM transitions and that usually results in powering down some
+more low-level components that are not directly controlled by the kernel.
+
+The state of devices and CPUs is saved and held in memory. All devices are
+suspended and put into low-power states. In many cases, all peripheral buses
+lose power when entering S2RAM, so devices must be able to handle the transition
+back to the "on" state.
+
+On ACPI-based systems S2RAM requires some minimal boot-strapping code in the
+platform firmware to resume the system from it. This may be the case on other
+platforms too.
+
+The set of devices that can wake up the system from S2RAM usually is reduced
+relative to :ref:`suspend-to-idle <s2idle>` and :ref:`standby <standby>` and it
+may be necessary to rely on the platform for setting up the wakeup functionality
+as appropriate.
+
+S2RAM is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration option
+is set and the support for it is registered by the platform with the core system
+suspend subsystem. On ACPI-based systems it is mapped to the S3 system state
+defined by ACPI.
+
+.. _hibernation:
+
+Hibernation
+-----------
+
+This state (also referred to as Suspend-to-Disk or STD) offers the greatest
+energy savings and can be used even in the absence of low-level platform support
+for system suspend. However, it requires some low-level code for resuming the
+system to be present for the underlying CPU architecture.
+
+Hibernation is significantly different from any of the system suspend variants.
+It takes three system state changes to put it into hibernation and two system
+state changes to resume it.
+
+First, when hibernation is triggered, the kernel stops all system activity and
+creates a snapshot image of memory to be written into persistent storage. Next,
+the system goes into a state in which the snapshot image can be saved, the image
+is written out and finally the system goes into the target low-power state in
+which power is cut from almost all of its hardware components, including memory,
+except for a limited set of wakeup devices.
+
+Once the snapshot image has been written out, the system may either enter a
+special low-power state (like ACPI S4), or it may simply power down itself.
+Powering down means minimum power draw and it allows this mechanism to work on
+any system. However, entering a special low-power state may allow additional
+means of system wakeup to be used (e.g. pressing a key on the keyboard or
+opening a laptop lid).
+
+After wakeup, control goes to the platform firmware that runs a boot loader
+which boots a fresh instance of the kernel (control may also go directly to
+the boot loader, depending on the system configuration, but anyway it causes
+a fresh instance of the kernel to be booted). That new instance of the kernel
+(referred to as the ``restore kernel``) looks for a hibernation image in
+persistent storage and if one is found, it is loaded into memory. Next, all
+activity in the system is stopped and the restore kernel overwrites itself with
+the image contents and jumps into a special trampoline area in the original
+kernel stored in the image (referred to as the ``image kernel``), which is where
+the special architecture-specific low-level code is needed. Finally, the
+image kernel restores the system to the pre-hibernation state and allows user
+space to run again.
+
+Hibernation is supported if the :c:macro:`CONFIG_HIBERNATION` kernel
+configuration option is set. However, this option can only be set if support
+for the given CPU architecture includes the low-level code for system resume.
+
+
+Basic ``sysfs`` Interfaces for System Suspend and Hibernation
+=============================================================
+
+The following files located in the :file:`/sys/power/` directory can be used by
+user space for sleep states control.
+
+``state``
+ This file contains a list of strings representing sleep states supported
+ by the kernel. Writing one of these strings into it causes the kernel
+ to start a transition of the system into the sleep state represented by
+ that string.
+
+ In particular, the strings "disk", "freeze" and "standby" represent the
+ :ref:`hibernation <hibernation>`, :ref:`suspend-to-idle <s2idle>` and
+ :ref:`standby <standby>` sleep states, respectively. The string "mem"
+ is interpreted in accordance with the contents of the ``mem_sleep`` file
+ described below.
+
+ If the kernel does not support any system sleep states, this file is
+ not present.
+
+``mem_sleep``
+ This file contains a list of strings representing supported system
+ suspend variants and allows user space to select the variant to be
+ associated with the "mem" string in the ``state`` file described above.
+
+ The strings that may be present in this file are "s2idle", "shallow"
+ and "deep". The string "s2idle" always represents :ref:`suspend-to-idle
+ <s2idle>` and, by convention, "shallow" and "deep" represent
+ :ref:`standby <standby>` and :ref:`suspend-to-RAM <s2ram>`,
+ respectively.
+
+ Writing one of the listed strings into this file causes the system
+ suspend variant represented by it to be associated with the "mem" string
+ in the ``state`` file. The string representing the suspend variant
+ currently associated with the "mem" string in the ``state`` file
+ is listed in square brackets.
+
+ If the kernel does not support system suspend, this file is not present.
+
+``disk``
+ This file contains a list of strings representing different operations
+ that can be carried out after the hibernation image has been saved. The
+ possible options are as follows:
+
+ ``platform``
+ Put the system into a special low-power state (e.g. ACPI S4) to
+ make additional wakeup options available and possibly allow the
+ platform firmware to take a simplified initialization path after
+ wakeup.
+
+ ``shutdown``
+ Power off the system.
+
+ ``reboot``
+ Reboot the system (useful for diagnostics mostly).
+
+ ``suspend``
+ Hybrid system suspend. Put the system into the suspend sleep
+ state selected through the ``mem_sleep`` file described above.
+ If the system is successfully woken up from that state, discard
+ the hibernation image and continue. Otherwise, use the image
+ to restore the previous state of the system.
+
+ ``test_resume``
+ Diagnostic operation. Load the image as though the system had
+ just woken up from hibernation and the currently running kernel
+ instance was a restore kernel and follow up with full system
+ resume.
+
+ Writing one of the listed strings into this file causes the option
+ represented by it to be selected.
+
+ The currently selected option is shown in square brackets which means
+ that the operation represented by it will be carried out after creating
+ and saving the image next time hibernation is triggered by writing
+ ``disk`` to :file:`/sys/power/state`.
+
+ If the kernel does not support hibernation, this file is not present.
+
+According to the above, there are two ways to make the system go into the
+:ref:`suspend-to-idle <s2idle>` state. The first one is to write "freeze"
+directly to :file:`/sys/power/state`. The second one is to write "s2idle" to
+:file:`/sys/power/mem_sleep` and then to write "mem" to
+:file:`/sys/power/state`. Likewise, there are two ways to make the system go
+into the :ref:`standby <standby>` state (the strings to write to the control
+files in that case are "standby" or "shallow" and "mem", respectively) if that
+state is supported by the platform. However, there is only one way to make the
+system go into the :ref:`suspend-to-RAM <s2ram>` state (write "deep" into
+:file:`/sys/power/mem_sleep` and "mem" into :file:`/sys/power/state`).
+
+The default suspend variant (ie. the one to be used without writing anything
+into :file:`/sys/power/mem_sleep`) is either "deep" (on the majority of systems
+supporting :ref:`suspend-to-RAM <s2ram>`) or "s2idle", but it can be overridden
+by the value of the "mem_sleep_default" parameter in the kernel command line.
+On some ACPI-based systems, depending on the information in the ACPI tables, the
+default may be "s2idle" even if :ref:`suspend-to-RAM <s2ram>` is supported.
diff --git a/Documentation/admin-guide/pm/strategies.rst b/Documentation/admin-guide/pm/strategies.rst
new file mode 100644
index 000000000000..afe4d3f831fe
--- /dev/null
+++ b/Documentation/admin-guide/pm/strategies.rst
@@ -0,0 +1,52 @@
+===========================
+Power Management Strategies
+===========================
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
+
+The Linux kernel supports two major high-level power management strategies.
+
+One of them is based on using global low-power states of the whole system in
+which user space code cannot be executed and the overall system activity is
+significantly reduced, referred to as :doc:`sleep states <sleep-states>`. The
+kernel puts the system into one of these states when requested by user space
+and the system stays in it until a special signal is received from one of
+designated devices, triggering a transition to the ``working state`` in which
+user space code can run. Because sleep states are global and the whole system
+is affected by the state changes, this strategy is referred to as the
+:doc:`system-wide power management <system-wide>`.
+
+The other strategy, referred to as the :doc:`working-state power management
+<working-state>`, is based on adjusting the power states of individual hardware
+components of the system, as needed, in the working state. In consequence, if
+this strategy is in use, the working state of the system usually does not
+correspond to any particular physical configuration of it, but can be treated as
+a metastate covering a range of different power states of the system in which
+the individual components of it can be either ``active`` (in use) or
+``inactive`` (idle). If they are active, they have to be in power states
+allowing them to process data and to be accessed by software. In turn, if they
+are inactive, ideally, they should be in low-power states in which they may not
+be accessible.
+
+If all of the system components are active, the system as a whole is regarded as
+"runtime active" and that situation typically corresponds to the maximum power
+draw (or maximum energy usage) of it. If all of them are inactive, the system
+as a whole is regarded as "runtime idle" which may be very close to a sleep
+state from the physical system configuration and power draw perspective, but
+then it takes much less time and effort to start executing user space code than
+for the same system in a sleep state. However, transitions from sleep states
+back to the working state can only be started by a limited set of devices, so
+typically the system can spend much more time in a sleep state than it can be
+runtime idle in one go. For this reason, systems usually use less energy in
+sleep states than when they are runtime idle most of the time.
+
+Moreover, the two power management strategies address different usage scenarios.
+Namely, if the user indicates that the system will not be in use going forward,
+for example by closing its lid (if the system is a laptop), it probably should
+go into a sleep state at that point. On the other hand, if the user simply goes
+away from the laptop keyboard, it probably should stay in the working state and
+use the working-state power management in case it becomes idle, because the user
+may come back to it at any time and then may want the system to be immediately
+accessible.
diff --git a/Documentation/admin-guide/pm/system-wide.rst b/Documentation/admin-guide/pm/system-wide.rst
new file mode 100644
index 000000000000..0c81e4c5de39
--- /dev/null
+++ b/Documentation/admin-guide/pm/system-wide.rst
@@ -0,0 +1,8 @@
+============================
+System-Wide Power Management
+============================
+
+.. toctree::
+ :maxdepth: 2
+
+ sleep-states
diff --git a/Documentation/admin-guide/pm/working-state.rst b/Documentation/admin-guide/pm/working-state.rst
new file mode 100644
index 000000000000..fa01bf083dfe
--- /dev/null
+++ b/Documentation/admin-guide/pm/working-state.rst
@@ -0,0 +1,9 @@
+==============================
+Working-State Power Management
+==============================
+
+.. toctree::
+ :maxdepth: 2
+
+ cpufreq
+ intel_pstate
diff --git a/Documentation/arm/firmware.txt b/Documentation/arm/firmware.txt
index da6713adac8a..7f175dbb427e 100644
--- a/Documentation/arm/firmware.txt
+++ b/Documentation/arm/firmware.txt
@@ -60,7 +60,7 @@ Example of using a firmware operation:
/* some platform code, e.g. SMP initialization */
- __raw_writel(virt_to_phys(exynos4_secondary_startup),
+ __raw_writel(__pa_symbol(exynos4_secondary_startup),
CPU1_BOOT_REG);
/* Call Exynos specific smc call */
diff --git a/Documentation/arm64/cpu-feature-registers.txt b/Documentation/arm64/cpu-feature-registers.txt
index d1c97f9f51cc..dad411d635d8 100644
--- a/Documentation/arm64/cpu-feature-registers.txt
+++ b/Documentation/arm64/cpu-feature-registers.txt
@@ -179,6 +179,8 @@ infrastructure:
| FCMA | [19-16] | y |
|--------------------------------------------------|
| JSCVT | [15-12] | y |
+ |--------------------------------------------------|
+ | DPB | [3-0] | y |
x--------------------------------------------------x
Appendix I: Example
diff --git a/Documentation/atomic_bitops.txt b/Documentation/atomic_bitops.txt
new file mode 100644
index 000000000000..5550bfdcce5f
--- /dev/null
+++ b/Documentation/atomic_bitops.txt
@@ -0,0 +1,66 @@
+
+On atomic bitops.
+
+
+While our bitmap_{}() functions are non-atomic, we have a number of operations
+operating on single bits in a bitmap that are atomic.
+
+
+API
+---
+
+The single bit operations are:
+
+Non-RMW ops:
+
+ test_bit()
+
+RMW atomic operations without return value:
+
+ {set,clear,change}_bit()
+ clear_bit_unlock()
+
+RMW atomic operations with return value:
+
+ test_and_{set,clear,change}_bit()
+ test_and_set_bit_lock()
+
+Barriers:
+
+ smp_mb__{before,after}_atomic()
+
+
+All RMW atomic operations have a '__' prefixed variant which is non-atomic.
+
+
+SEMANTICS
+---------
+
+Non-atomic ops:
+
+In particular __clear_bit_unlock() suffers the same issue as atomic_set(),
+which is why the generic version maps to clear_bit_unlock(), see atomic_t.txt.
+
+
+RMW ops:
+
+The test_and_{}_bit() operations return the original value of the bit.
+
+
+ORDERING
+--------
+
+Like with atomic_t, the rule of thumb is:
+
+ - non-RMW operations are unordered;
+
+ - RMW operations that have no return value are unordered;
+
+ - RMW operations that have a return value are fully ordered.
+
+Except for test_and_set_bit_lock() which has ACQUIRE semantics and
+clear_bit_unlock() which has RELEASE semantics.
+
+Since a platform only has a single means of achieving atomic operations
+the same barriers as for atomic_t are used, see atomic_t.txt.
+
diff --git a/Documentation/atomic_t.txt b/Documentation/atomic_t.txt
new file mode 100644
index 000000000000..913396ac5824
--- /dev/null
+++ b/Documentation/atomic_t.txt
@@ -0,0 +1,242 @@
+
+On atomic types (atomic_t atomic64_t and atomic_long_t).
+
+The atomic type provides an interface to the architecture's means of atomic
+RMW operations between CPUs (atomic operations on MMIO are not supported and
+can lead to fatal traps on some platforms).
+
+API
+---
+
+The 'full' API consists of (atomic64_ and atomic_long_ prefixes omitted for
+brevity):
+
+Non-RMW ops:
+
+ atomic_read(), atomic_set()
+ atomic_read_acquire(), atomic_set_release()
+
+
+RMW atomic operations:
+
+Arithmetic:
+
+ atomic_{add,sub,inc,dec}()
+ atomic_{add,sub,inc,dec}_return{,_relaxed,_acquire,_release}()
+ atomic_fetch_{add,sub,inc,dec}{,_relaxed,_acquire,_release}()
+
+
+Bitwise:
+
+ atomic_{and,or,xor,andnot}()
+ atomic_fetch_{and,or,xor,andnot}{,_relaxed,_acquire,_release}()
+
+
+Swap:
+
+ atomic_xchg{,_relaxed,_acquire,_release}()
+ atomic_cmpxchg{,_relaxed,_acquire,_release}()
+ atomic_try_cmpxchg{,_relaxed,_acquire,_release}()
+
+
+Reference count (but please see refcount_t):
+
+ atomic_add_unless(), atomic_inc_not_zero()
+ atomic_sub_and_test(), atomic_dec_and_test()
+
+
+Misc:
+
+ atomic_inc_and_test(), atomic_add_negative()
+ atomic_dec_unless_positive(), atomic_inc_unless_negative()
+
+
+Barriers:
+
+ smp_mb__{before,after}_atomic()
+
+
+
+SEMANTICS
+---------
+
+Non-RMW ops:
+
+The non-RMW ops are (typically) regular LOADs and STOREs and are canonically
+implemented using READ_ONCE(), WRITE_ONCE(), smp_load_acquire() and
+smp_store_release() respectively.
+
+The one detail to this is that atomic_set{}() should be observable to the RMW
+ops. That is:
+
+ C atomic-set
+
+ {
+ atomic_set(v, 1);
+ }
+
+ P1(atomic_t *v)
+ {
+ atomic_add_unless(v, 1, 0);
+ }
+
+ P2(atomic_t *v)
+ {
+ atomic_set(v, 0);
+ }
+
+ exists
+ (v=2)
+
+In this case we would expect the atomic_set() from CPU1 to either happen
+before the atomic_add_unless(), in which case that latter one would no-op, or
+_after_ in which case we'd overwrite its result. In no case is "2" a valid
+outcome.
+
+This is typically true on 'normal' platforms, where a regular competing STORE
+will invalidate a LL/SC or fail a CMPXCHG.
+
+The obvious case where this is not so is when we need to implement atomic ops
+with a lock:
+
+ CPU0 CPU1
+
+ atomic_add_unless(v, 1, 0);
+ lock();
+ ret = READ_ONCE(v->counter); // == 1
+ atomic_set(v, 0);
+ if (ret != u) WRITE_ONCE(v->counter, 0);
+ WRITE_ONCE(v->counter, ret + 1);
+ unlock();
+
+the typical solution is to then implement atomic_set{}() with atomic_xchg().
+
+
+RMW ops:
+
+These come in various forms:
+
+ - plain operations without return value: atomic_{}()
+
+ - operations which return the modified value: atomic_{}_return()
+
+ these are limited to the arithmetic operations because those are
+ reversible. Bitops are irreversible and therefore the modified value
+ is of dubious utility.
+
+ - operations which return the original value: atomic_fetch_{}()
+
+ - swap operations: xchg(), cmpxchg() and try_cmpxchg()
+
+ - misc; the special purpose operations that are commonly used and would,
+ given the interface, normally be implemented using (try_)cmpxchg loops but
+ are time critical and can, (typically) on LL/SC architectures, be more
+ efficiently implemented.
+
+All these operations are SMP atomic; that is, the operations (for a single
+atomic variable) can be fully ordered and no intermediate state is lost or
+visible.
+
+
+ORDERING (go read memory-barriers.txt first)
+--------
+
+The rule of thumb:
+
+ - non-RMW operations are unordered;
+
+ - RMW operations that have no return value are unordered;
+
+ - RMW operations that have a return value are fully ordered;
+
+ - RMW operations that are conditional are unordered on FAILURE,
+ otherwise the above rules apply.
+
+Except of course when an operation has an explicit ordering like:
+
+ {}_relaxed: unordered
+ {}_acquire: the R of the RMW (or atomic_read) is an ACQUIRE
+ {}_release: the W of the RMW (or atomic_set) is a RELEASE
+
+Where 'unordered' is against other memory locations. Address dependencies are
+not defeated.
+
+Fully ordered primitives are ordered against everything prior and everything
+subsequent. Therefore a fully ordered primitive is like having an smp_mb()
+before and an smp_mb() after the primitive.
+
+
+The barriers:
+
+ smp_mb__{before,after}_atomic()
+
+only apply to the RMW ops and can be used to augment/upgrade the ordering
+inherent to the used atomic op. These barriers provide a full smp_mb().
+
+These helper barriers exist because architectures have varying implicit
+ordering on their SMP atomic primitives. For example our TSO architectures
+provide full ordered atomics and these barriers are no-ops.
+
+Thus:
+
+ atomic_fetch_add();
+
+is equivalent to:
+
+ smp_mb__before_atomic();
+ atomic_fetch_add_relaxed();
+ smp_mb__after_atomic();
+
+However the atomic_fetch_add() might be implemented more efficiently.
+
+Further, while something like:
+
+ smp_mb__before_atomic();
+ atomic_dec(&X);
+
+is a 'typical' RELEASE pattern, the barrier is strictly stronger than
+a RELEASE. Similarly for something like:
+
+ atomic_inc(&X);
+ smp_mb__after_atomic();
+
+is an ACQUIRE pattern (though very much not typical), but again the barrier is
+strictly stronger than ACQUIRE. As illustrated:
+
+ C strong-acquire
+
+ {
+ }
+
+ P1(int *x, atomic_t *y)
+ {
+ r0 = READ_ONCE(*x);
+ smp_rmb();
+ r1 = atomic_read(y);
+ }
+
+ P2(int *x, atomic_t *y)
+ {
+ atomic_inc(y);
+ smp_mb__after_atomic();
+ WRITE_ONCE(*x, 1);
+ }
+
+ exists
+ (r0=1 /\ r1=0)
+
+This should not happen; but a hypothetical atomic_inc_acquire() --
+(void)atomic_fetch_inc_acquire() for instance -- would allow the outcome,
+since then:
+
+ P1 P2
+
+ t = LL.acq *y (0)
+ t++;
+ *x = 1;
+ r0 = *x (1)
+ RMB
+ r1 = *y (0)
+ SC *y, t;
+
+is allowed.
diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 4fced8a21307..257e65714c6a 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -168,6 +168,7 @@ max_comp_streams RW the number of possible concurrent compress operations
comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes
+backing_dev RW set up backend storage for zram to write out
User space is advised to use the following files to read the device statistics.
@@ -231,5 +232,15 @@ line of text and contains the following stats separated by whitespace:
resets the disksize to zero. You must set the disksize again
before reusing the device.
+* Optional Feature
+
+= writeback
+
+With incompressible pages, there is no memory saving with zram.
+Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
+to backing storage rather than keeping it in memory.
+User should set up backing device via /sys/block/zramX/backing_dev
+before disksize setting.
+
Nitin Gupta
ngupta@vflare.org
diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index bde177103567..dc44785dc0fa 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -18,7 +18,9 @@ v1 is available under Documentation/cgroup-v1/.
1-2. What is cgroup?
2. Basic Operations
2-1. Mounting
- 2-2. Organizing Processes
+ 2-2. Organizing Processes and Threads
+ 2-2-1. Processes
+ 2-2-2. Threads
2-3. [Un]populated Notification
2-4. Controlling Controllers
2-4-1. Enabling and Disabling
@@ -167,8 +169,11 @@ cgroup v2 currently supports the following mount options.
Delegation section for details.
-Organizing Processes
---------------------
+Organizing Processes and Threads
+--------------------------------
+
+Processes
+~~~~~~~~~
Initially, only the root cgroup exists to which all processes belong.
A child cgroup can be created by creating a sub-directory::
@@ -219,6 +224,105 @@ is removed subsequently, " (deleted)" is appended to the path::
0::/test-cgroup/test-cgroup-nested (deleted)
+Threads
+~~~~~~~
+
+cgroup v2 supports thread granularity for a subset of controllers to
+support use cases requiring hierarchical resource distribution across
+the threads of a group of processes. By default, all threads of a
+process belong to the same cgroup, which also serves as the resource
+domain to host resource consumptions which are not specific to a
+process or thread. The thread mode allows threads to be spread across
+a subtree while still maintaining the common resource domain for them.
+
+Controllers which support thread mode are called threaded controllers.
+The ones which don't are called domain controllers.
+
+Marking a cgroup threaded makes it join the resource domain of its
+parent as a threaded cgroup. The parent may be another threaded
+cgroup whose resource domain is further up in the hierarchy. The root
+of a threaded subtree, that is, the nearest ancestor which is not
+threaded, is called threaded domain or thread root interchangeably and
+serves as the resource domain for the entire subtree.
+
+Inside a threaded subtree, threads of a process can be put in
+different cgroups and are not subject to the no internal process
+constraint - threaded controllers can be enabled on non-leaf cgroups
+whether they have threads in them or not.
+
+As the threaded domain cgroup hosts all the domain resource
+consumptions of the subtree, it is considered to have internal
+resource consumptions whether there are processes in it or not and
+can't have populated child cgroups which aren't threaded. Because the
+root cgroup is not subject to no internal process constraint, it can
+serve both as a threaded domain and a parent to domain cgroups.
+
+The current operation mode or type of the cgroup is shown in the
+"cgroup.type" file which indicates whether the cgroup is a normal
+domain, a domain which is serving as the domain of a threaded subtree,
+or a threaded cgroup.
+
+On creation, a cgroup is always a domain cgroup and can be made
+threaded by writing "threaded" to the "cgroup.type" file. The
+operation is single direction::
+
+ # echo threaded > cgroup.type
+
+Once threaded, the cgroup can't be made a domain again. To enable the
+thread mode, the following conditions must be met.
+
+- As the cgroup will join the parent's resource domain. The parent
+ must either be a valid (threaded) domain or a threaded cgroup.
+
+- When the parent is an unthreaded domain, it must not have any domain
+ controllers enabled or populated domain children. The root is
+ exempt from this requirement.
+
+Topology-wise, a cgroup can be in an invalid state. Please consider
+the following toplogy::
+
+ A (threaded domain) - B (threaded) - C (domain, just created)
+
+C is created as a domain but isn't connected to a parent which can
+host child domains. C can't be used until it is turned into a
+threaded cgroup. "cgroup.type" file will report "domain (invalid)" in
+these cases. Operations which fail due to invalid topology use
+EOPNOTSUPP as the errno.
+
+A domain cgroup is turned into a threaded domain when one of its child
+cgroup becomes threaded or threaded controllers are enabled in the
+"cgroup.subtree_control" file while there are processes in the cgroup.
+A threaded domain reverts to a normal domain when the conditions
+clear.
+
+When read, "cgroup.threads" contains the list of the thread IDs of all
+threads in the cgroup. Except that the operations are per-thread
+instead of per-process, "cgroup.threads" has the same format and
+behaves the same way as "cgroup.procs". While "cgroup.threads" can be
+written to in any cgroup, as it can only move threads inside the same
+threaded domain, its operations are confined inside each threaded
+subtree.
+
+The threaded domain cgroup serves as the resource domain for the whole
+subtree, and, while the threads can be scattered across the subtree,
+all the processes are considered to be in the threaded domain cgroup.
+"cgroup.procs" in a threaded domain cgroup contains the PIDs of all
+processes in the subtree and is not readable in the subtree proper.
+However, "cgroup.procs" can be written to from anywhere in the subtree
+to migrate all threads of the matching process to the cgroup.
+
+Only threaded controllers can be enabled in a threaded subtree. When
+a threaded controller is enabled inside a threaded subtree, it only
+accounts for and controls resource consumptions associated with the
+threads in the cgroup and its descendants. All consumptions which
+aren't tied to a specific thread belong to the threaded domain cgroup.
+
+Because a threaded subtree is exempt from no internal process
+constraint, a threaded controller must be able to handle competition
+between threads in a non-leaf cgroup and its child cgroups. Each
+threaded controller defines how such competitions are handled.
+
+
[Un]populated Notification
--------------------------
@@ -302,15 +406,15 @@ disabled if one or more children have it enabled.
No Internal Process Constraint
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-Non-root cgroups can only distribute resources to their children when
-they don't have any processes of their own. In other words, only
-cgroups which don't contain any processes can have controllers enabled
-in their "cgroup.subtree_control" files.
+Non-root cgroups can distribute domain resources to their children
+only when they don't have any processes of their own. In other words,
+only domain cgroups which don't contain any processes can have domain
+controllers enabled in their "cgroup.subtree_control" files.
-This guarantees that, when a controller is looking at the part of the
-hierarchy which has it enabled, processes are always only on the
-leaves. This rules out situations where child cgroups compete against
-internal processes of the parent.
+This guarantees that, when a domain controller is looking at the part
+of the hierarchy which has it enabled, processes are always only on
+the leaves. This rules out situations where child cgroups compete
+against internal processes of the parent.
The root cgroup is exempt from this restriction. Root contains
processes and anonymous resource consumption which can't be associated
@@ -334,10 +438,10 @@ Model of Delegation
~~~~~~~~~~~~~~~~~~~
A cgroup can be delegated in two ways. First, to a less privileged
-user by granting write access of the directory and its "cgroup.procs"
-and "cgroup.subtree_control" files to the user. Second, if the
-"nsdelegate" mount option is set, automatically to a cgroup namespace
-on namespace creation.
+user by granting write access of the directory and its "cgroup.procs",
+"cgroup.threads" and "cgroup.subtree_control" files to the user.
+Second, if the "nsdelegate" mount option is set, automatically to a
+cgroup namespace on namespace creation.
Because the resource control interface files in a given directory
control the distribution of the parent's resources, the delegatee
@@ -644,6 +748,29 @@ Core Interface Files
All cgroup core files are prefixed with "cgroup."
+ cgroup.type
+
+ A read-write single value file which exists on non-root
+ cgroups.
+
+ When read, it indicates the current type of the cgroup, which
+ can be one of the following values.
+
+ - "domain" : A normal valid domain cgroup.
+
+ - "domain threaded" : A threaded domain cgroup which is
+ serving as the root of a threaded subtree.
+
+ - "domain invalid" : A cgroup which is in an invalid state.
+ It can't be populated or have controllers enabled. It may
+ be allowed to become a threaded cgroup.
+
+ - "threaded" : A threaded cgroup which is a member of a
+ threaded subtree.
+
+ A cgroup can be turned into a threaded cgroup by writing
+ "threaded" to this file.
+
cgroup.procs
A read-write new-line separated values file which exists on
all cgroups.
@@ -658,9 +785,6 @@ All cgroup core files are prefixed with "cgroup."
the PID to the cgroup. The writer should match all of the
following conditions.
- - Its euid is either root or must match either uid or suid of
- the target process.
-
- It must have write access to the "cgroup.procs" file.
- It must have write access to the "cgroup.procs" file of the
@@ -669,6 +793,35 @@ All cgroup core files are prefixed with "cgroup."
When delegating a sub-hierarchy, write access to this file
should be granted along with the containing directory.
+ In a threaded cgroup, reading this file fails with EOPNOTSUPP
+ as all the processes belong to the thread root. Writing is
+ supported and moves every thread of the process to the cgroup.
+
+ cgroup.threads
+ A read-write new-line separated values file which exists on
+ all cgroups.
+
+ When read, it lists the TIDs of all threads which belong to
+ the cgroup one-per-line. The TIDs are not ordered and the
+ same TID may show up more than once if the thread got moved to
+ another cgroup and then back or the TID got recycled while
+ reading.
+
+ A TID can be written to migrate the thread associated with the
+ TID to the cgroup. The writer should match all of the
+ following conditions.
+
+ - It must have write access to the "cgroup.threads" file.
+
+ - The cgroup that the thread is currently in must be in the
+ same resource domain as the destination cgroup.
+
+ - It must have write access to the "cgroup.procs" file of the
+ common ancestor of the source and destination cgroups.
+
+ When delegating a sub-hierarchy, write access to this file
+ should be granted along with the containing directory.
+
cgroup.controllers
A read-only space separated values file which exists on all
cgroups.
@@ -701,6 +854,38 @@ All cgroup core files are prefixed with "cgroup."
1 if the cgroup or its descendants contains any live
processes; otherwise, 0.
+ cgroup.max.descendants
+ A read-write single value files. The default is "max".
+
+ Maximum allowed number of descent cgroups.
+ If the actual number of descendants is equal or larger,
+ an attempt to create a new cgroup in the hierarchy will fail.
+
+ cgroup.max.depth
+ A read-write single value files. The default is "max".
+
+ Maximum allowed descent depth below the current cgroup.
+ If the actual descent depth is equal or larger,
+ an attempt to create a new child cgroup will fail.
+
+ cgroup.stat
+ A read-only flat-keyed file with the following entries:
+
+ nr_descendants
+ Total number of visible descendant cgroups.
+
+ nr_dying_descendants
+ Total number of dying descendant cgroups. A cgroup becomes
+ dying after being deleted by a user. The cgroup will remain
+ in dying state for some time undefined time (which can depend
+ on system load) before being completely destroyed.
+
+ A process can't enter a dying cgroup under any circumstances,
+ a dying cgroup can't revive.
+
+ A dying cgroup can consume system resources not exceeding
+ limits, which were active at the moment of cgroup deletion.
+
Controllers
===========
diff --git a/Documentation/conf.py b/Documentation/conf.py
index 71b032bb44fd..f9054ab60cb1 100644
--- a/Documentation/conf.py
+++ b/Documentation/conf.py
@@ -29,7 +29,7 @@ from load_config import loadConfig
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
-needs_sphinx = '1.2'
+needs_sphinx = '1.3'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
@@ -344,8 +344,8 @@ if major == 1 and minor > 3:
if major == 1 and minor <= 4:
latex_elements['preamble'] += '\\usepackage[margin=0.5in, top=1in, bottom=1in]{geometry}'
elif major == 1 and (minor > 5 or (minor == 5 and patch >= 3)):
- latex_elements['sphinxsetup'] = 'hmargin=0.5in, vmargin=0.5in'
-
+ latex_elements['sphinxsetup'] = 'hmargin=0.5in, vmargin=1in'
+ latex_elements['preamble'] += '\\fvset{fontsize=auto}\n'
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
diff --git a/Documentation/core-api/genalloc.rst b/Documentation/core-api/genalloc.rst
new file mode 100644
index 000000000000..6b38a39fab24
--- /dev/null
+++ b/Documentation/core-api/genalloc.rst
@@ -0,0 +1,144 @@
+The genalloc/genpool subsystem
+==============================
+
+There are a number of memory-allocation subsystems in the kernel, each
+aimed at a specific need. Sometimes, however, a kernel developer needs to
+implement a new allocator for a specific range of special-purpose memory;
+often that memory is located on a device somewhere. The author of the
+driver for that device can certainly write a little allocator to get the
+job done, but that is the way to fill the kernel with dozens of poorly
+tested allocators. Back in 2005, Jes Sorensen lifted one of those
+allocators from the sym53c8xx_2 driver and posted_ it as a generic module
+for the creation of ad hoc memory allocators. This code was merged
+for the 2.6.13 release; it has been modified considerably since then.
+
+.. _posted: https://lwn.net/Articles/125842/
+
+Code using this allocator should include <linux/genalloc.h>. The action
+begins with the creation of a pool using one of:
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_create
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: devm_gen_pool_create
+
+A call to :c:func:`gen_pool_create` will create a pool. The granularity of
+allocations is set with min_alloc_order; it is a log-base-2 number like
+those used by the page allocator, but it refers to bytes rather than pages.
+So, if min_alloc_order is passed as 3, then all allocations will be a
+multiple of eight bytes. Increasing min_alloc_order decreases the memory
+required to track the memory in the pool. The nid parameter specifies
+which NUMA node should be used for the allocation of the housekeeping
+structures; it can be -1 if the caller doesn't care.
+
+The "managed" interface :c:func:`devm_gen_pool_create` ties the pool to a
+specific device. Among other things, it will automatically clean up the
+pool when the given device is destroyed.
+
+A pool is shut down with:
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_destroy
+
+It's worth noting that, if there are still allocations outstanding from the
+given pool, this function will take the rather extreme step of invoking
+BUG(), crashing the entire system. You have been warned.
+
+A freshly created pool has no memory to allocate. It is fairly useless in
+that state, so one of the first orders of business is usually to add memory
+to the pool. That can be done with one of:
+
+.. kernel-doc:: include/linux/genalloc.h
+ :functions: gen_pool_add
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_add_virt
+
+A call to :c:func:`gen_pool_add` will place the size bytes of memory
+starting at addr (in the kernel's virtual address space) into the given
+pool, once again using nid as the node ID for ancillary memory allocations.
+The :c:func:`gen_pool_add_virt` variant associates an explicit physical
+address with the memory; this is only necessary if the pool will be used
+for DMA allocations.
+
+The functions for allocating memory from the pool (and putting it back)
+are:
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_alloc
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_dma_alloc
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_free
+
+As one would expect, :c:func:`gen_pool_alloc` will allocate size< bytes
+from the given pool. The :c:func:`gen_pool_dma_alloc` variant allocates
+memory for use with DMA operations, returning the associated physical
+address in the space pointed to by dma. This will only work if the memory
+was added with :c:func:`gen_pool_add_virt`. Note that this function
+departs from the usual genpool pattern of using unsigned long values to
+represent kernel addresses; it returns a void * instead.
+
+That all seems relatively simple; indeed, some developers clearly found it
+to be too simple. After all, the interface above provides no control over
+how the allocation functions choose which specific piece of memory to
+return. If that sort of control is needed, the following functions will be
+of interest:
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_alloc_algo
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_set_algo
+
+Allocations with :c:func:`gen_pool_alloc_algo` specify an algorithm to be
+used to choose the memory to be allocated; the default algorithm can be set
+with :c:func:`gen_pool_set_algo`. The data value is passed to the
+algorithm; most ignore it, but it is occasionally needed. One can,
+naturally, write a special-purpose algorithm, but there is a fair set
+already available:
+
+- gen_pool_first_fit is a simple first-fit allocator; this is the default
+ algorithm if none other has been specified.
+
+- gen_pool_first_fit_align forces the allocation to have a specific
+ alignment (passed via data in a genpool_data_align structure).
+
+- gen_pool_first_fit_order_align aligns the allocation to the order of the
+ size. A 60-byte allocation will thus be 64-byte aligned, for example.
+
+- gen_pool_best_fit, as one would expect, is a simple best-fit allocator.
+
+- gen_pool_fixed_alloc allocates at a specific offset (passed in a
+ genpool_data_fixed structure via the data parameter) within the pool.
+ If the indicated memory is not available the allocation fails.
+
+There is a handful of other functions, mostly for purposes like querying
+the space available in the pool or iterating through chunks of memory.
+Most users, however, should not need much beyond what has been described
+above. With luck, wider awareness of this module will help to prevent the
+writing of special-purpose memory allocators in the future.
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_virt_to_phys
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_for_each_chunk
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: addr_in_gen_pool
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_avail
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_size
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: gen_pool_get
+
+.. kernel-doc:: lib/genalloc.c
+ :functions: of_gen_pool_get
diff --git a/Documentation/core-api/index.rst b/Documentation/core-api/index.rst
index 0606be3a3111..d5bbe035316d 100644
--- a/Documentation/core-api/index.rst
+++ b/Documentation/core-api/index.rst
@@ -20,6 +20,7 @@ Core utilities
genericirq
flexible-arrays
librs
+ genalloc
Interfaces for kernel debugging
===============================
diff --git a/Documentation/core-api/kernel-api.rst b/Documentation/core-api/kernel-api.rst
index 17b00914c6ab..8282099e0cbf 100644
--- a/Documentation/core-api/kernel-api.rst
+++ b/Documentation/core-api/kernel-api.rst
@@ -344,3 +344,52 @@ codecs, and devices with strict requirements for interface clocking.
.. kernel-doc:: include/linux/clk.h
:internal:
+
+Synchronization Primitives
+==========================
+
+Read-Copy Update (RCU)
+----------------------
+
+.. kernel-doc:: include/linux/rcupdate.h
+ :external:
+
+.. kernel-doc:: include/linux/rcupdate_wait.h
+ :external:
+
+.. kernel-doc:: include/linux/rcutree.h
+ :external:
+
+.. kernel-doc:: kernel/rcu/tree.c
+ :external:
+
+.. kernel-doc:: kernel/rcu/tree_plugin.h
+ :external:
+
+.. kernel-doc:: kernel/rcu/tree_exp.h
+ :external:
+
+.. kernel-doc:: kernel/rcu/update.c
+ :external:
+
+.. kernel-doc:: include/linux/srcu.h
+ :external:
+
+.. kernel-doc:: kernel/rcu/srcutree.c
+ :external:
+
+.. kernel-doc:: include/linux/rculist_bl.h
+ :external:
+
+.. kernel-doc:: include/linux/rculist.h
+ :external:
+
+.. kernel-doc:: include/linux/rculist_nulls.h
+ :external:
+
+.. kernel-doc:: include/linux/rcu_sync.h
+ :external:
+
+.. kernel-doc:: kernel/rcu/sync.c
+ :external:
+
diff --git a/Documentation/core-api/workqueue.rst b/Documentation/core-api/workqueue.rst
index ffdec94fbca1..3943b5bfa8cf 100644
--- a/Documentation/core-api/workqueue.rst
+++ b/Documentation/core-api/workqueue.rst
@@ -243,11 +243,15 @@ throttling the number of active work items, specifying '0' is
recommended.
Some users depend on the strict execution ordering of ST wq. The
-combination of ``@max_active`` of 1 and ``WQ_UNBOUND`` is used to
-achieve this behavior. Work items on such wq are always queued to the
-unbound worker-pools and only one work item can be active at any given
+combination of ``@max_active`` of 1 and ``WQ_UNBOUND`` used to
+achieve this behavior. Work items on such wq were always queued to the
+unbound worker-pools and only one work item could be active at any given
time thus achieving the same ordering property as ST wq.
+In the current implementation the above configuration only guarantees
+ST behavior within a given NUMA node. Instead alloc_ordered_queue should
+be used to achieve system wide ST behavior.
+
Example Execution Scenarios
===========================
diff --git a/Documentation/dev-tools/gdb-kernel-debugging.rst b/Documentation/dev-tools/gdb-kernel-debugging.rst
index 5e93c9bc6619..19df79286f00 100644
--- a/Documentation/dev-tools/gdb-kernel-debugging.rst
+++ b/Documentation/dev-tools/gdb-kernel-debugging.rst
@@ -31,11 +31,13 @@ Setup
CONFIG_DEBUG_INFO_REDUCED off. If your architecture supports
CONFIG_FRAME_POINTER, keep it enabled.
-- Install that kernel on the guest.
+- Install that kernel on the guest, turn off KASLR if necessary by adding
+ "nokaslr" to the kernel command line.
Alternatively, QEMU allows to boot the kernel directly using -kernel,
-append, -initrd command line switches. This is generally only useful if
you do not depend on modules. See QEMU documentation for more details on
- this mode.
+ this mode. In this case, you should build the kernel with
+ CONFIG_RANDOMIZE_BASE disabled if the architecture supports KASLR.
- Enable the gdb stub of QEMU/KVM, either
diff --git a/Documentation/dev-tools/kgdb.rst b/Documentation/dev-tools/kgdb.rst
index 75273203a35a..d38be58f872a 100644
--- a/Documentation/dev-tools/kgdb.rst
+++ b/Documentation/dev-tools/kgdb.rst
@@ -348,6 +348,15 @@ default behavior is always set to 0.
- ``echo 1 > /sys/module/debug_core/parameters/kgdbreboot``
- Enter the debugger on reboot notify.
+Kernel parameter: ``nokaslr``
+-----------------------------
+
+If the architecture that you are using enable KASLR by default,
+you should consider turning it off. KASLR randomizes the
+virtual address where the kernel image is mapped and confuse
+gdb which resolve kernel symbol address from symbol table
+of vmlinux.
+
Using kdb
=========
@@ -358,7 +367,7 @@ This is a quick example of how to use kdb.
1. Configure kgdboc at boot using kernel parameters::
- console=ttyS0,115200 kgdboc=ttyS0,115200
+ console=ttyS0,115200 kgdboc=ttyS0,115200 nokaslr
OR
diff --git a/Documentation/devicetree/bindings/arm/coresight.txt b/Documentation/devicetree/bindings/arm/coresight.txt
index fcbae6a5e6c1..15ac8e8dcfdf 100644
--- a/Documentation/devicetree/bindings/arm/coresight.txt
+++ b/Documentation/devicetree/bindings/arm/coresight.txt
@@ -34,8 +34,8 @@ its hardware characteristcs.
- Embedded Trace Macrocell (version 4.x):
"arm,coresight-etm4x", "arm,primecell";
- - Qualcomm Configurable Replicator (version 1.x):
- "qcom,coresight-replicator1x", "arm,primecell";
+ - Coresight programmable Replicator :
+ "arm,coresight-dynamic-replicator", "arm,primecell";
- System Trace Macrocell:
"arm,coresight-stm", "arm,primecell"; [1]
diff --git a/Documentation/devicetree/bindings/arm/pmu.txt b/Documentation/devicetree/bindings/arm/pmu.txt
index 61c8b4620415..13611a8199bb 100644
--- a/Documentation/devicetree/bindings/arm/pmu.txt
+++ b/Documentation/devicetree/bindings/arm/pmu.txt
@@ -9,9 +9,11 @@ Required properties:
- compatible : should be one of
"apm,potenza-pmu"
"arm,armv8-pmuv3"
+ "arm,cortex-a73-pmu"
"arm,cortex-a72-pmu"
"arm,cortex-a57-pmu"
"arm,cortex-a53-pmu"
+ "arm,cortex-a35-pmu"
"arm,cortex-a17-pmu"
"arm,cortex-a15-pmu"
"arm,cortex-a12-pmu"
diff --git a/Documentation/devicetree/bindings/ata/ahci-mtk.txt b/Documentation/devicetree/bindings/ata/ahci-mtk.txt
new file mode 100644
index 000000000000..d2aa696b161b
--- /dev/null
+++ b/Documentation/devicetree/bindings/ata/ahci-mtk.txt
@@ -0,0 +1,51 @@
+MediaTek Serial ATA controller
+
+Required properties:
+ - compatible : Must be "mediatek,<chip>-ahci", "mediatek,mtk-ahci".
+ When using "mediatek,mtk-ahci" compatible strings, you
+ need SoC specific ones in addition, one of:
+ - "mediatek,mt7622-ahci"
+ - reg : Physical base addresses and length of register sets.
+ - interrupts : Interrupt associated with the SATA device.
+ - interrupt-names : Associated name must be: "hostc".
+ - clocks : A list of phandle and clock specifier pairs, one for each
+ entry in clock-names.
+ - clock-names : Associated names must be: "ahb", "axi", "asic", "rbc", "pm".
+ - phys : A phandle and PHY specifier pair for the PHY port.
+ - phy-names : Associated name must be: "sata-phy".
+ - ports-implemented : See ./ahci-platform.txt for details.
+
+Optional properties:
+ - power-domains : A phandle and power domain specifier pair to the power
+ domain which is responsible for collapsing and restoring
+ power to the peripheral.
+ - resets : Must contain an entry for each entry in reset-names.
+ See ../reset/reset.txt for details.
+ - reset-names : Associated names must be: "axi", "sw", "reg".
+ - mediatek,phy-mode : A phandle to the system controller, used to enable
+ SATA function.
+
+Example:
+
+ sata: sata@1a200000 {
+ compatible = "mediatek,mt7622-ahci",
+ "mediatek,mtk-ahci";
+ reg = <0 0x1a200000 0 0x1100>;
+ interrupts = <GIC_SPI 233 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "hostc";
+ clocks = <&pciesys CLK_SATA_AHB_EN>,
+ <&pciesys CLK_SATA_AXI_EN>,
+ <&pciesys CLK_SATA_ASIC_EN>,
+ <&pciesys CLK_SATA_RBC_EN>,
+ <&pciesys CLK_SATA_PM_EN>;
+ clock-names = "ahb", "axi", "asic", "rbc", "pm";
+ phys = <&u3port1 PHY_TYPE_SATA>;
+ phy-names = "sata-phy";
+ ports-implemented = <0x1>;
+ power-domains = <&scpsys MT7622_POWER_DOMAIN_HIF0>;
+ resets = <&pciesys MT7622_SATA_AXI_BUS_RST>,
+ <&pciesys MT7622_SATA_PHY_SW_RST>,
+ <&pciesys MT7622_SATA_PHY_REG_RST>;
+ reset-names = "axi", "sw", "reg";
+ mediatek,phy-mode = <&pciesys>;
+ };
diff --git a/Documentation/devicetree/bindings/clock/mt8173-cpu-dvfs.txt b/Documentation/devicetree/bindings/clock/mt8173-cpu-dvfs.txt
deleted file mode 100644
index 52b457c23eed..000000000000
--- a/Documentation/devicetree/bindings/clock/mt8173-cpu-dvfs.txt
+++ /dev/null
@@ -1,83 +0,0 @@
-Device Tree Clock bindins for CPU DVFS of Mediatek MT8173 SoC
-
-Required properties:
-- clocks: A list of phandle + clock-specifier pairs for the clocks listed in clock names.
-- clock-names: Should contain the following:
- "cpu" - The multiplexer for clock input of CPU cluster.
- "intermediate" - A parent of "cpu" clock which is used as "intermediate" clock
- source (usually MAINPLL) when the original CPU PLL is under
- transition and not stable yet.
- Please refer to Documentation/devicetree/bindings/clk/clock-bindings.txt for
- generic clock consumer properties.
-- proc-supply: Regulator for Vproc of CPU cluster.
-
-Optional properties:
-- sram-supply: Regulator for Vsram of CPU cluster. When present, the cpufreq driver
- needs to do "voltage tracking" to step by step scale up/down Vproc and
- Vsram to fit SoC specific needs. When absent, the voltage scaling
- flow is handled by hardware, hence no software "voltage tracking" is
- needed.
-
-Example:
---------
- cpu0: cpu@0 {
- device_type = "cpu";
- compatible = "arm,cortex-a53";
- reg = <0x000>;
- enable-method = "psci";
- cpu-idle-states = <&CPU_SLEEP_0>;
- clocks = <&infracfg CLK_INFRA_CA53SEL>,
- <&apmixedsys CLK_APMIXED_MAINPLL>;
- clock-names = "cpu", "intermediate";
- };
-
- cpu1: cpu@1 {
- device_type = "cpu";
- compatible = "arm,cortex-a53";
- reg = <0x001>;
- enable-method = "psci";
- cpu-idle-states = <&CPU_SLEEP_0>;
- clocks = <&infracfg CLK_INFRA_CA53SEL>,
- <&apmixedsys CLK_APMIXED_MAINPLL>;
- clock-names = "cpu", "intermediate";
- };
-
- cpu2: cpu@100 {
- device_type = "cpu";
- compatible = "arm,cortex-a57";
- reg = <0x100>;
- enable-method = "psci";
- cpu-idle-states = <&CPU_SLEEP_0>;
- clocks = <&infracfg CLK_INFRA_CA57SEL>,
- <&apmixedsys CLK_APMIXED_MAINPLL>;
- clock-names = "cpu", "intermediate";
- };
-
- cpu3: cpu@101 {
- device_type = "cpu";
- compatible = "arm,cortex-a57";
- reg = <0x101>;
- enable-method = "psci";
- cpu-idle-states = <&CPU_SLEEP_0>;
- clocks = <&infracfg CLK_INFRA_CA57SEL>,
- <&apmixedsys CLK_APMIXED_MAINPLL>;
- clock-names = "cpu", "intermediate";
- };
-
- &cpu0 {
- proc-supply = <&mt6397_vpca15_reg>;
- };
-
- &cpu1 {
- proc-supply = <&mt6397_vpca15_reg>;
- };
-
- &cpu2 {
- proc-supply = <&da9211_vcpu_reg>;
- sram-supply = <&mt6397_vsramca7_reg>;
- };
-
- &cpu3 {
- proc-supply = <&da9211_vcpu_reg>;
- sram-supply = <&mt6397_vsramca7_reg>;
- };
diff --git a/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek.txt b/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek.txt
new file mode 100644
index 000000000000..f6403089edcf
--- /dev/null
+++ b/Documentation/devicetree/bindings/cpufreq/cpufreq-mediatek.txt
@@ -0,0 +1,247 @@
+Binding for MediaTek's CPUFreq driver
+=====================================
+
+Required properties:
+- clocks: A list of phandle + clock-specifier pairs for the clocks listed in clock names.
+- clock-names: Should contain the following:
+ "cpu" - The multiplexer for clock input of CPU cluster.
+ "intermediate" - A parent of "cpu" clock which is used as "intermediate" clock
+ source (usually MAINPLL) when the original CPU PLL is under
+ transition and not stable yet.
+ Please refer to Documentation/devicetree/bindings/clk/clock-bindings.txt for
+ generic clock consumer properties.
+- operating-points-v2: Please refer to Documentation/devicetree/bindings/opp/opp.txt
+ for detail.
+- proc-supply: Regulator for Vproc of CPU cluster.
+
+Optional properties:
+- sram-supply: Regulator for Vsram of CPU cluster. When present, the cpufreq driver
+ needs to do "voltage tracking" to step by step scale up/down Vproc and
+ Vsram to fit SoC specific needs. When absent, the voltage scaling
+ flow is handled by hardware, hence no software "voltage tracking" is
+ needed.
+- #cooling-cells:
+- cooling-min-level:
+- cooling-max-level:
+ Please refer to Documentation/devicetree/bindings/thermal/thermal.txt
+ for detail.
+
+Example 1 (MT7623 SoC):
+
+ cpu_opp_table: opp_table {
+ compatible = "operating-points-v2";
+ opp-shared;
+
+ opp-598000000 {
+ opp-hz = /bits/ 64 <598000000>;
+ opp-microvolt = <1050000>;
+ };
+
+ opp-747500000 {
+ opp-hz = /bits/ 64 <747500000>;
+ opp-microvolt = <1050000>;
+ };
+
+ opp-1040000000 {
+ opp-hz = /bits/ 64 <1040000000>;
+ opp-microvolt = <1150000>;
+ };
+
+ opp-1196000000 {
+ opp-hz = /bits/ 64 <1196000000>;
+ opp-microvolt = <1200000>;
+ };
+
+ opp-1300000000 {
+ opp-hz = /bits/ 64 <1300000000>;
+ opp-microvolt = <1300000>;
+ };
+ };
+
+ cpu0: cpu@0 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a7";
+ reg = <0x0>;
+ clocks = <&infracfg CLK_INFRA_CPUSEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table>;
+ #cooling-cells = <2>;
+ cooling-min-level = <0>;
+ cooling-max-level = <7>;
+ };
+ cpu@1 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a7";
+ reg = <0x1>;
+ operating-points-v2 = <&cpu_opp_table>;
+ };
+ cpu@2 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a7";
+ reg = <0x2>;
+ operating-points-v2 = <&cpu_opp_table>;
+ };
+ cpu@3 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a7";
+ reg = <0x3>;
+ operating-points-v2 = <&cpu_opp_table>;
+ };
+
+Example 2 (MT8173 SoC):
+ cpu_opp_table_a: opp_table_a {
+ compatible = "operating-points-v2";
+ opp-shared;
+
+ opp-507000000 {
+ opp-hz = /bits/ 64 <507000000>;
+ opp-microvolt = <859000>;
+ };
+
+ opp-702000000 {
+ opp-hz = /bits/ 64 <702000000>;
+ opp-microvolt = <908000>;
+ };
+
+ opp-1001000000 {
+ opp-hz = /bits/ 64 <1001000000>;
+ opp-microvolt = <983000>;
+ };
+
+ opp-1105000000 {
+ opp-hz = /bits/ 64 <1105000000>;
+ opp-microvolt = <1009000>;
+ };
+
+ opp-1183000000 {
+ opp-hz = /bits/ 64 <1183000000>;
+ opp-microvolt = <1028000>;
+ };
+
+ opp-1404000000 {
+ opp-hz = /bits/ 64 <1404000000>;
+ opp-microvolt = <1083000>;
+ };
+
+ opp-1508000000 {
+ opp-hz = /bits/ 64 <1508000000>;
+ opp-microvolt = <1109000>;
+ };
+
+ opp-1573000000 {
+ opp-hz = /bits/ 64 <1573000000>;
+ opp-microvolt = <1125000>;
+ };
+ };
+
+ cpu_opp_table_b: opp_table_b {
+ compatible = "operating-points-v2";
+ opp-shared;
+
+ opp-507000000 {
+ opp-hz = /bits/ 64 <507000000>;
+ opp-microvolt = <828000>;
+ };
+
+ opp-702000000 {
+ opp-hz = /bits/ 64 <702000000>;
+ opp-microvolt = <867000>;
+ };
+
+ opp-1001000000 {
+ opp-hz = /bits/ 64 <1001000000>;
+ opp-microvolt = <927000>;
+ };
+
+ opp-1209000000 {
+ opp-hz = /bits/ 64 <1209000000>;
+ opp-microvolt = <968000>;
+ };
+
+ opp-1404000000 {
+ opp-hz = /bits/ 64 <1007000000>;
+ opp-microvolt = <1028000>;
+ };
+
+ opp-1612000000 {
+ opp-hz = /bits/ 64 <1612000000>;
+ opp-microvolt = <1049000>;
+ };
+
+ opp-1807000000 {
+ opp-hz = /bits/ 64 <1807000000>;
+ opp-microvolt = <1089000>;
+ };
+
+ opp-1989000000 {
+ opp-hz = /bits/ 64 <1989000000>;
+ opp-microvolt = <1125000>;
+ };
+ };
+
+ cpu0: cpu@0 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a53";
+ reg = <0x000>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_SLEEP_0>;
+ clocks = <&infracfg CLK_INFRA_CA53SEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table_a>;
+ };
+
+ cpu1: cpu@1 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a53";
+ reg = <0x001>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_SLEEP_0>;
+ clocks = <&infracfg CLK_INFRA_CA53SEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table_a>;
+ };
+
+ cpu2: cpu@100 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a57";
+ reg = <0x100>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_SLEEP_0>;
+ clocks = <&infracfg CLK_INFRA_CA57SEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table_b>;
+ };
+
+ cpu3: cpu@101 {
+ device_type = "cpu";
+ compatible = "arm,cortex-a57";
+ reg = <0x101>;
+ enable-method = "psci";
+ cpu-idle-states = <&CPU_SLEEP_0>;
+ clocks = <&infracfg CLK_INFRA_CA57SEL>,
+ <&apmixedsys CLK_APMIXED_MAINPLL>;
+ clock-names = "cpu", "intermediate";
+ operating-points-v2 = <&cpu_opp_table_b>;
+ };
+
+ &cpu0 {
+ proc-supply = <&mt6397_vpca15_reg>;
+ };
+
+ &cpu1 {
+ proc-supply = <&mt6397_vpca15_reg>;
+ };
+
+ &cpu2 {
+ proc-supply = <&da9211_vcpu_reg>;
+ sram-supply = <&mt6397_vsramca7_reg>;
+ };
+
+ &cpu3 {
+ proc-supply = <&da9211_vcpu_reg>;
+ sram-supply = <&mt6397_vsramca7_reg>;
+ };
diff --git a/Documentation/devicetree/bindings/crypto/artpec6-crypto.txt b/Documentation/devicetree/bindings/crypto/artpec6-crypto.txt
new file mode 100644
index 000000000000..d9cca4875bd6
--- /dev/null
+++ b/Documentation/devicetree/bindings/crypto/artpec6-crypto.txt
@@ -0,0 +1,16 @@
+Axis crypto engine with PDMA interface.
+
+Required properties:
+- compatible : Should be one of the following strings:
+ "axis,artpec6-crypto" for the version in the Axis ARTPEC-6 SoC
+ "axis,artpec7-crypto" for the version in the Axis ARTPEC-7 SoC.
+- reg: Base address and size for the PDMA register area.
+- interrupts: Interrupt handle for the PDMA interrupt line.
+
+Example:
+
+crypto@f4264000 {
+ compatible = "axis,artpec6-crypto";
+ reg = <0xf4264000 0x1000>;
+ interrupts = <GIC_SPI 19 IRQ_TYPE_LEVEL_HIGH>;
+};
diff --git a/Documentation/devicetree/bindings/crypto/atmel-crypto.txt b/Documentation/devicetree/bindings/crypto/atmel-crypto.txt
index f2aab3dc2b52..7de1a9674c70 100644
--- a/Documentation/devicetree/bindings/crypto/atmel-crypto.txt
+++ b/Documentation/devicetree/bindings/crypto/atmel-crypto.txt
@@ -66,3 +66,16 @@ sha@f8034000 {
dmas = <&dma1 2 17>;
dma-names = "tx";
};
+
+* Eliptic Curve Cryptography (I2C)
+
+Required properties:
+- compatible : must be "atmel,atecc508a".
+- reg: I2C bus address of the device.
+- clock-frequency: must be present in the i2c controller node.
+
+Example:
+atecc508a@C0 {
+ compatible = "atmel,atecc508a";
+ reg = <0xC0>;
+};
diff --git a/Documentation/devicetree/bindings/crypto/st,stm32-hash.txt b/Documentation/devicetree/bindings/crypto/st,stm32-hash.txt
new file mode 100644
index 000000000000..04fc246f02f7
--- /dev/null
+++ b/Documentation/devicetree/bindings/crypto/st,stm32-hash.txt
@@ -0,0 +1,30 @@
+* STMicroelectronics STM32 HASH
+
+Required properties:
+- compatible: Should contain entries for this and backward compatible
+ HASH versions:
+ - "st,stm32f456-hash" for stm32 F456.
+ - "st,stm32f756-hash" for stm32 F756.
+- reg: The address and length of the peripheral registers space
+- interrupts: the interrupt specifier for the HASH
+- clocks: The input clock of the HASH instance
+
+Optional properties:
+- resets: The input reset of the HASH instance
+- dmas: DMA specifiers for the HASH. See the DMA client binding,
+ Documentation/devicetree/bindings/dma/dma.txt
+- dma-names: DMA request name. Should be "in" if a dma is present.
+- dma-maxburst: Set number of maximum dma burst supported
+
+Example:
+
+hash1: hash@50060400 {
+ compatible = "st,stm32f756-hash";
+ reg = <0x50060400 0x400>;
+ interrupts = <80>;
+ clocks = <&rcc 0 STM32F7_AHB2_CLOCK(HASH)>;
+ resets = <&rcc STM32F7_AHB2_RESET(HASH)>;
+ dmas = <&dma2 7 2 0x400 0x0>;
+ dma-names = "in";
+ dma-maxburst = <0>;
+};
diff --git a/Documentation/devicetree/bindings/display/bridge/dw_mipi_dsi.txt b/Documentation/devicetree/bindings/display/bridge/dw_mipi_dsi.txt
new file mode 100644
index 000000000000..b13adf30b8d3
--- /dev/null
+++ b/Documentation/devicetree/bindings/display/bridge/dw_mipi_dsi.txt
@@ -0,0 +1,32 @@
+Synopsys DesignWare MIPI DSI host controller
+============================================
+
+This document defines device tree properties for the Synopsys DesignWare MIPI
+DSI host controller. It doesn't constitue a device tree binding specification
+by itself but is meant to be referenced by platform-specific device tree
+bindings.
+
+When referenced from platform device tree bindings the properties defined in
+this document are defined as follows. The platform device tree bindings are
+responsible for defining whether each optional property is used or not.
+
+- reg: Memory mapped base address and length of the DesignWare MIPI DSI
+ host controller registers. (mandatory)
+
+- clocks: References to all the clocks specified in the clock-names property
+ as specified in [1]. (mandatory)
+
+- clock-names:
+ - "pclk" is the peripheral clock for either AHB and APB. (mandatory)
+ - "px_clk" is the pixel clock for the DPI/RGB input. (optional)
+
+- resets: References to all the resets specified in the reset-names property
+ as specified in [2]. (optional)
+
+- reset-names: string reset name, must be "apb" if used. (optional)
+
+- panel or bridge node: see [3]. (mandatory)
+
+[1] Documentation/devicetree/bindings/clock/clock-bindings.txt
+[2] Documentation/devicetree/bindings/reset/reset.txt
+[3] Documentation/devicetree/bindings/display/mipi-dsi-bus.txt
diff --git a/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt b/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt
index 549c538b38a5..fc2588292a68 100644
--- a/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt
+++ b/Documentation/devicetree/bindings/display/exynos/exynos5433-decon.txt
@@ -25,12 +25,6 @@ Required properties:
size-cells must 1 and 0, respectively.
- port: contains an endpoint node which is connected to the endpoint in the mic
node. The reg value muset be 0.
-- i80-if-timings: specify whether the panel which is connected to decon uses
- i80 lcd interface or mipi video interface. This node contains
- no timing information as that of fimd does. Because there is
- no register in decon to specify i80 interface timing value,
- it is not needed, but make it remain to use same kind of node
- in fimd and exynos7 decon.
Example:
SoC specific DT entry:
@@ -59,9 +53,3 @@ decon: decon@13800000 {
};
};
};
-
-Board specific DT entry:
-&decon {
- i80-if-timings {
- };
-};
diff --git a/Documentation/devicetree/bindings/display/repaper.txt b/Documentation/devicetree/bindings/display/repaper.txt
new file mode 100644
index 000000000000..f5f9f9cf6a25
--- /dev/null
+++ b/Documentation/devicetree/bindings/display/repaper.txt
@@ -0,0 +1,52 @@
+Pervasive Displays RePaper branded e-ink displays
+
+Required properties:
+- compatible: "pervasive,e1144cs021" for 1.44" display
+ "pervasive,e1190cs021" for 1.9" display
+ "pervasive,e2200cs021" for 2.0" display
+ "pervasive,e2271cs021" for 2.7" display
+
+- panel-on-gpios: Timing controller power control
+- discharge-gpios: Discharge control
+- reset-gpios: RESET pin
+- busy-gpios: BUSY pin
+
+Required property for e2271cs021:
+- border-gpios: Border control
+
+The node for this driver must be a child node of a SPI controller, hence
+all mandatory properties described in ../spi/spi-bus.txt must be specified.
+
+Optional property:
+- pervasive,thermal-zone: name of thermometer's thermal zone
+
+Example:
+
+ display_temp: lm75@48 {
+ compatible = "lm75b";
+ reg = <0x48>;
+ #thermal-sensor-cells = <0>;
+ };
+
+ thermal-zones {
+ display {
+ polling-delay-passive = <0>;
+ polling-delay = <0>;
+ thermal-sensors = <&display_temp>;
+ };
+ };
+
+ papirus27@0{
+ compatible = "pervasive,e2271cs021";
+ reg = <0>;
+
+ spi-max-frequency = <8000000>;
+
+ panel-on-gpios = <&gpio 23 0>;
+ border-gpios = <&gpio 14 0>;
+ discharge-gpios = <&gpio 15 0>;
+ reset-gpios = <&gpio 24 0>;
+ busy-gpios = <&gpio 25 0>;
+
+ pervasive,thermal-zone = "display";
+ };
diff --git a/Documentation/devicetree/bindings/display/rockchip/dw_hdmi-rockchip.txt b/Documentation/devicetree/bindings/display/rockchip/dw_hdmi-rockchip.txt
index 046076c6b277..fad8b7619647 100644
--- a/Documentation/devicetree/bindings/display/rockchip/dw_hdmi-rockchip.txt
+++ b/Documentation/devicetree/bindings/display/rockchip/dw_hdmi-rockchip.txt
@@ -11,7 +11,9 @@ following device-specific properties.
Required properties:
-- compatible: Shall contain "rockchip,rk3288-dw-hdmi".
+- compatible: should be one of the following:
+ "rockchip,rk3288-dw-hdmi"
+ "rockchip,rk3399-dw-hdmi"
- reg: See dw_hdmi.txt.
- reg-io-width: See dw_hdmi.txt. Shall be 4.
- interrupts: HDMI interrupt number
@@ -30,7 +32,8 @@ Optional properties
I2C master controller.
- clock-names: See dw_hdmi.txt. The "cec" clock is optional.
- clock-names: May contain "cec" as defined in dw_hdmi.txt.
-
+- clock-names: May contain "grf", power for grf io.
+- clock-names: May contain "vpll", external clock for some hdmi phy.
Example:
diff --git a/Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt b/Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt
index 9eb3f0a2a078..5d835d9c1ba8 100644
--- a/Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt
+++ b/Documentation/devicetree/bindings/display/rockchip/rockchip-vop.txt
@@ -8,8 +8,12 @@ Required properties:
- compatible: value should be one of the following
"rockchip,rk3036-vop";
"rockchip,rk3288-vop";
+ "rockchip,rk3368-vop";
+ "rockchip,rk3366-vop";
"rockchip,rk3399-vop-big";
"rockchip,rk3399-vop-lit";
+ "rockchip,rk3228-vop";
+ "rockchip,rk3328-vop";
- interrupts: should contain a list of all VOP IP block interrupts in the
order: VSYNC, LCD_SYSTEM. The interrupt specifier
diff --git a/Documentation/devicetree/bindings/display/sitronix,st7586.txt b/Documentation/devicetree/bindings/display/sitronix,st7586.txt
new file mode 100644
index 000000000000..1d0dad1210d3
--- /dev/null
+++ b/Documentation/devicetree/bindings/display/sitronix,st7586.txt
@@ -0,0 +1,22 @@
+Sitronix ST7586 display panel
+
+Required properties:
+- compatible: "lego,ev3-lcd".
+- a0-gpios: The A0 signal (since this binding is for serial mode, this is
+ the pin labeled D1 on the controller, not the pin labeled A0)
+- reset-gpios: Reset pin
+
+The node for this driver must be a child node of a SPI controller, hence
+all mandatory properties described in ../spi/spi-bus.txt must be specified.
+
+Optional properties:
+- rotation: panel rotation in degrees counter clockwise (0,90,180,270)
+
+Example:
+ display@0{
+ compatible = "lego,ev3-lcd";
+ reg = <0>;
+ spi-max-frequency = <10000000>;
+ a0-gpios = <&gpio 43 GPIO_ACTIVE_HIGH>;
+ reset-gpios = <&gpio 80 GPIO_ACTIVE_HIGH>;
+ };
diff --git a/Documentation/devicetree/bindings/display/st,stm32-ltdc.txt b/Documentation/devicetree/bindings/display/st,stm32-ltdc.txt
index 8e1476941c0f..74b5ac7b26d6 100644
--- a/Documentation/devicetree/bindings/display/st,stm32-ltdc.txt
+++ b/Documentation/devicetree/bindings/display/st,stm32-ltdc.txt
@@ -1,7 +1,6 @@
* STMicroelectronics STM32 lcd-tft display controller
- ltdc: lcd-tft display controller host
- must be a sub-node of st-display-subsystem
Required properties:
- compatible: "st,stm32-ltdc"
- reg: Physical base address of the IP registers and length of memory mapped region.
@@ -13,8 +12,40 @@
Required nodes:
- Video port for RGB output.
-Example:
+* STMicroelectronics STM32 DSI controller specific extensions to Synopsys
+ DesignWare MIPI DSI host controller
+The STMicroelectronics STM32 DSI controller uses the Synopsys DesignWare MIPI
+DSI host controller. For all mandatory properties & nodes, please refer
+to the related documentation in [5].
+
+Mandatory properties specific to STM32 DSI:
+- #address-cells: Should be <1>.
+- #size-cells: Should be <0>.
+- compatible: "st,stm32-dsi".
+- clock-names:
+ - phy pll reference clock string name, must be "ref".
+- resets: see [5].
+- reset-names: see [5].
+
+Mandatory nodes specific to STM32 DSI:
+- ports: A node containing DSI input & output port nodes with endpoint
+ definitions as documented in [3] & [4].
+ - port@0: DSI input port node, connected to the ltdc rgb output port.
+ - port@1: DSI output port node, connected to a panel or a bridge input port.
+- panel or bridge node: A node containing the panel or bridge description as
+ documented in [6].
+ - port: panel or bridge port node, connected to the DSI output port (port@1).
+
+Note: You can find more documentation in the following references
+[1] Documentation/devicetree/bindings/clock/clock-bindings.txt
+[2] Documentation/devicetree/bindings/reset/reset.txt
+[3] Documentation/devicetree/bindings/media/video-interfaces.txt
+[4] Documentation/devicetree/bindings/graph.txt
+[5] Documentation/devicetree/bindings/display/bridge/dw_mipi_dsi.txt
+[6] Documentation/devicetree/bindings/display/mipi-dsi-bus.txt
+
+Example 1: RGB panel
/ {
...
soc {
@@ -34,3 +65,73 @@ Example:
};
};
};
+
+Example 2: DSI panel
+
+/ {
+ ...
+ soc {
+ ...
+ ltdc: display-controller@40016800 {
+ compatible = "st,stm32-ltdc";
+ reg = <0x40016800 0x200>;
+ interrupts = <88>, <89>;
+ resets = <&rcc STM32F4_APB2_RESET(LTDC)>;
+ clocks = <&rcc 1 CLK_LCD>;
+ clock-names = "lcd";
+
+ port {
+ ltdc_out_dsi: endpoint {
+ remote-endpoint = <&dsi_in>;
+ };
+ };
+ };
+
+
+ dsi: dsi@40016c00 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ compatible = "st,stm32-dsi";
+ reg = <0x40016c00 0x800>;
+ clocks = <&rcc 1 CLK_F469_DSI>, <&clk_hse>;
+ clock-names = "ref", "pclk";
+ resets = <&rcc STM32F4_APB2_RESET(DSI)>;
+ reset-names = "apb";
+
+ ports {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ port@0 {
+ reg = <0>;
+ dsi_in: endpoint {
+ remote-endpoint = <&ltdc_out_dsi>;
+ };
+ };
+
+ port@1 {
+ reg = <1>;
+ dsi_out: endpoint {
+ remote-endpoint = <&dsi_in_panel>;
+ };
+ };
+
+ };
+
+ panel-dsi@0 {
+ reg = <0>; /* dsi virtual channel (0..3) */
+ compatible = ...;
+ enable-gpios = ...;
+
+ port {
+ dsi_in_panel: endpoint {
+ remote-endpoint = <&dsi_out>;
+ };
+ };
+
+ };
+
+ };
+
+ };
+};
diff --git a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
index b83e6018041d..2ee6ff0ef98e 100644
--- a/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
+++ b/Documentation/devicetree/bindings/display/sunxi/sun4i-drm.txt
@@ -4,15 +4,33 @@ Allwinner A10 Display Pipeline
The Allwinner A10 Display pipeline is composed of several components
that are going to be documented below:
-For the input port of all components up to the TCON in the display
-pipeline, if there are multiple components, the local endpoint IDs
-must correspond to the index of the upstream block. For example, if
-the remote endpoint is Frontend 1, then the local endpoint ID must
-be 1.
-
-Conversely, for the output ports of the same group, the remote endpoint
-ID must be the index of the local hardware block. If the local backend
-is backend 1, then the remote endpoint ID must be 1.
+For all connections between components up to the TCONs in the display
+pipeline, when there are multiple components of the same type at the
+same depth, the local endpoint ID must be the same as the remote
+component's index. For example, if the remote endpoint is Frontend 1,
+then the local endpoint ID must be 1.
+
+ Frontend 0 [0] ------- [0] Backend 0 [0] ------- [0] TCON 0
+ [1] -- -- [1] [1] -- -- [1]
+ \ / \ /
+ X X
+ / \ / \
+ [0] -- -- [0] [0] -- -- [0]
+ Frontend 1 [1] ------- [1] Backend 1 [1] ------- [1] TCON 1
+
+For a two pipeline system such as the one depicted above, the lines
+represent the connections between the components, while the numbers
+within the square brackets corresponds to the ID of the local endpoint.
+
+The same rule also applies to DE 2.0 mixer-TCON connections:
+
+ Mixer 0 [0] ----------- [0] TCON 0
+ [1] ---- ---- [1]
+ \ /
+ X
+ / \
+ [0] ---- ---- [0]
+ Mixer 1 [1] ----------- [1] TCON 1
HDMI Encoder
------------
diff --git a/Documentation/devicetree/bindings/extcon/extcon-usbc-cros-ec.txt b/Documentation/devicetree/bindings/extcon/extcon-usbc-cros-ec.txt
new file mode 100644
index 000000000000..8e8625c00dfa
--- /dev/null
+++ b/Documentation/devicetree/bindings/extcon/extcon-usbc-cros-ec.txt
@@ -0,0 +1,24 @@
+ChromeOS EC USB Type-C cable and accessories detection
+
+On ChromeOS systems with USB Type C ports, the ChromeOS Embedded Controller is
+able to detect the state of external accessories such as display adapters
+or USB devices when said accessories are attached or detached.
+
+The node for this device must be under a cros-ec node like google,cros-ec-spi
+or google,cros-ec-i2c.
+
+Required properties:
+- compatible: Should be "google,extcon-usbc-cros-ec".
+- google,usb-port-id: Specifies the USB port ID to use.
+
+Example:
+ cros-ec@0 {
+ compatible = "google,cros-ec-i2c";
+
+ ...
+
+ extcon {
+ compatible = "google,extcon-usbc-cros-ec";
+ google,usb-port-id = <0>;
+ };
+ }
diff --git a/Documentation/devicetree/bindings/fpga/altera-passive-serial.txt b/Documentation/devicetree/bindings/fpga/altera-passive-serial.txt
new file mode 100644
index 000000000000..48478bc07e29
--- /dev/null
+++ b/Documentation/devicetree/bindings/fpga/altera-passive-serial.txt
@@ -0,0 +1,29 @@
+Altera Passive Serial SPI FPGA Manager
+
+Altera FPGAs support a method of loading the bitstream over what is
+referred to as "passive serial".
+The passive serial link is not technically SPI, and might require extra
+circuits in order to play nicely with other SPI slaves on the same bus.
+
+See https://www.altera.com/literature/hb/cyc/cyc_c51013.pdf
+
+Required properties:
+- compatible: Must be one of the following:
+ "altr,fpga-passive-serial",
+ "altr,fpga-arria10-passive-serial"
+- reg: SPI chip select of the FPGA
+- nconfig-gpios: config pin (referred to as nCONFIG in the manual)
+- nstat-gpios: status pin (referred to as nSTATUS in the manual)
+
+Optional properties:
+- confd-gpios: confd pin (referred to as CONF_DONE in the manual)
+
+Example:
+ fpga: fpga@0 {
+ compatible = "altr,fpga-passive-serial";
+ spi-max-frequency = <20000000>;
+ reg = <0>;
+ nconfig-gpios = <&gpio4 9 GPIO_ACTIVE_LOW>;
+ nstat-gpios = <&gpio4 11 GPIO_ACTIVE_LOW>;
+ confd-gpios = <&gpio4 12 GPIO_ACTIVE_LOW>;
+ };
diff --git a/Documentation/devicetree/bindings/fpga/xilinx-pr-decoupler.txt b/Documentation/devicetree/bindings/fpga/xilinx-pr-decoupler.txt
new file mode 100644
index 000000000000..8dcfba926bc7
--- /dev/null
+++ b/Documentation/devicetree/bindings/fpga/xilinx-pr-decoupler.txt
@@ -0,0 +1,36 @@
+Xilinx LogiCORE Partial Reconfig Decoupler Softcore
+
+The Xilinx LogiCORE Partial Reconfig Decoupler manages one or more
+decouplers / fpga bridges.
+The controller can decouple/disable the bridges which prevents signal
+changes from passing through the bridge. The controller can also
+couple / enable the bridges which allows traffic to pass through the
+bridge normally.
+
+The Driver supports only MMIO handling. A PR region can have multiple
+PR Decouplers which can be handled independently or chained via decouple/
+decouple_status signals.
+
+Required properties:
+- compatible : Should contain "xlnx,pr-decoupler-1.00" followed by
+ "xlnx,pr-decoupler"
+- regs : base address and size for decoupler module
+- clocks : input clock to IP
+- clock-names : should contain "aclk"
+
+Optional properties:
+- bridge-enable : 0 if driver should disable bridge at startup
+ 1 if driver should enable bridge at startup
+ Default is to leave bridge in current state.
+
+See Documentation/devicetree/bindings/fpga/fpga-region.txt for generic bindings.
+
+Example:
+ fpga-bridge@100000450 {
+ compatible = "xlnx,pr-decoupler-1.00",
+ "xlnx-pr-decoupler";
+ regs = <0x10000045 0x10>;
+ clocks = <&clkc 15>;
+ clock-names = "aclk";
+ bridge-enable = <0>;
+ };
diff --git a/Documentation/devicetree/bindings/gpio/gpio-74x164.txt b/Documentation/devicetree/bindings/gpio/gpio-74x164.txt
index ce1b2231bf5d..2a97553d8d76 100644
--- a/Documentation/devicetree/bindings/gpio/gpio-74x164.txt
+++ b/Documentation/devicetree/bindings/gpio/gpio-74x164.txt
@@ -12,6 +12,9 @@ Required properties:
1 = active low
- registers-number: Number of daisy-chained shift registers
+Optional properties:
+- enable-gpios: GPIO connected to the OE (Output Enable) pin.
+
Example:
gpio5: gpio5@0 {
diff --git a/Documentation/devicetree/bindings/gpio/gpio-aspeed.txt b/Documentation/devicetree/bindings/gpio/gpio-aspeed.txt
index c756afa88cc6..fc6378c778c5 100644
--- a/Documentation/devicetree/bindings/gpio/gpio-aspeed.txt
+++ b/Documentation/devicetree/bindings/gpio/gpio-aspeed.txt
@@ -18,7 +18,7 @@ Required properties:
Optional properties:
- interrupt-parent : The parent interrupt controller, optional if inherited
-- clocks : A phandle to the HPLL clock node for debounce timings
+- clocks : A phandle to the clock to use for debounce timings
The gpio and interrupt properties are further described in their respective
bindings documentation:
diff --git a/Documentation/devicetree/bindings/gpio/gpio-davinci.txt b/Documentation/devicetree/bindings/gpio/gpio-davinci.txt
index 5079ba7d6568..8beb0539b6d8 100644
--- a/Documentation/devicetree/bindings/gpio/gpio-davinci.txt
+++ b/Documentation/devicetree/bindings/gpio/gpio-davinci.txt
@@ -1,7 +1,10 @@
Davinci/Keystone GPIO controller bindings
Required Properties:
-- compatible: should be "ti,dm6441-gpio", "ti,keystone-gpio"
+- compatible: should be "ti,dm6441-gpio": for Davinci da850 SoCs
+ "ti,keystone-gpio": for Keystone 2 66AK2H/K, 66AK2L,
+ 66AK2E SoCs
+ "ti,k2g-gpio", "ti,keystone-gpio": for 66AK2G
- reg: Physical base address of the controller and the size of memory mapped
registers.
@@ -20,7 +23,21 @@ Required Properties:
- ti,ngpio: The number of GPIO pins supported.
- ti,davinci-gpio-unbanked: The number of GPIOs that have an individual interrupt
- line to processor.
+ line to processor.
+
+- clocks: Should contain the device's input clock, and should be defined as per
+ the appropriate clock bindings consumer usage in,
+
+ Documentation/devicetree/bindings/clock/keystone-gate.txt
+ for 66AK2HK/66AK2L/66AK2E SoCs or,
+
+ Documentation/devicetree/bindings/clock/ti,sci-clk.txt
+ for 66AK2G SoCs
+
+- clock-names: Name should be "gpio";
+
+Currently clock-names and clocks are needed for all keystone 2 platforms
+Davinci platforms do not have DT clocks as of now.
The GPIO controller also acts as an interrupt controller. It uses the default
two cells specifier as described in Documentation/devicetree/bindings/
@@ -60,3 +77,73 @@ leds {
...
};
};
+
+Example for 66AK2G:
+
+gpio0: gpio@2603000 {
+ compatible = "ti,k2g-gpio", "ti,keystone-gpio";
+ reg = <0x02603000 0x100>;
+ gpio-controller;
+ #gpio-cells = <2>;
+ interrupts = <GIC_SPI 432 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 433 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 434 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 435 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 436 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 437 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 438 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 439 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 440 IRQ_TYPE_EDGE_RISING>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ ti,ngpio = <144>;
+ ti,davinci-gpio-unbanked = <0>;
+ clocks = <&k2g_clks 0x001b 0x0>;
+ clock-names = "gpio";
+};
+
+Example for 66AK2HK/66AK2L/66AK2E:
+
+gpio0: gpio@260bf00 {
+ compatible = "ti,keystone-gpio";
+ reg = <0x0260bf00 0x100>;
+ gpio-controller;
+ #gpio-cells = <2>;
+ /* HW Interrupts mapped to GPIO pins */
+ interrupts = <GIC_SPI 120 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 121 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 122 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 123 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 124 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 125 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 126 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 127 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 128 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 129 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 130 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 131 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 132 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 133 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 134 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 135 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 136 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 137 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 138 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 139 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 140 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 141 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 142 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 143 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 144 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 145 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 146 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 147 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 148 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 149 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 150 IRQ_TYPE_EDGE_RISING>,
+ <GIC_SPI 151 IRQ_TYPE_EDGE_RISING>;
+ clocks = <&clkgpio>;
+ clock-names = "gpio";
+ ti,ngpio = <32>;
+ ti,davinci-gpio-unbanked = <32>;
+};
diff --git a/Documentation/devicetree/bindings/gpio/gpio-vf610.txt b/Documentation/devicetree/bindings/gpio/gpio-vf610.txt
index 436cc99c6598..0ccbae44019c 100644
--- a/Documentation/devicetree/bindings/gpio/gpio-vf610.txt
+++ b/Documentation/devicetree/bindings/gpio/gpio-vf610.txt
@@ -5,7 +5,9 @@ functionality. Each pair serves 32 GPIOs. The VF610 has 5 instances of
each, and each PORT module has its own interrupt.
Required properties for GPIO node:
-- compatible : Should be "fsl,<soc>-gpio", currently "fsl,vf610-gpio"
+- compatible : Should be "fsl,<soc>-gpio", below is supported list:
+ "fsl,vf610-gpio"
+ "fsl,imx7ulp-gpio"
- reg : The first reg tuple represents the PORT module, the second tuple
the GPIO module.
- interrupts : Should be the port interrupt shared by all 32 pins.
diff --git a/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt b/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt
index 6826a371fb69..51c86f69995e 100644
--- a/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt
+++ b/Documentation/devicetree/bindings/gpio/renesas,gpio-rcar.txt
@@ -2,8 +2,9 @@
Required Properties:
- - compatible: should contain one of the following.
+ - compatible: should contain one or more of the following:
- "renesas,gpio-r8a7743": for R8A7743 (RZ/G1M) compatible GPIO controller.
+ - "renesas,gpio-r8a7745": for R8A7745 (RZ/G1E) compatible GPIO controller.
- "renesas,gpio-r8a7778": for R8A7778 (R-Mobile M1) compatible GPIO controller.
- "renesas,gpio-r8a7779": for R8A7779 (R-Car H1) compatible GPIO controller.
- "renesas,gpio-r8a7790": for R8A7790 (R-Car H2) compatible GPIO controller.
@@ -13,7 +14,14 @@ Required Properties:
- "renesas,gpio-r8a7794": for R8A7794 (R-Car E2) compatible GPIO controller.
- "renesas,gpio-r8a7795": for R8A7795 (R-Car H3) compatible GPIO controller.
- "renesas,gpio-r8a7796": for R8A7796 (R-Car M3-W) compatible GPIO controller.
- - "renesas,gpio-rcar": for generic R-Car GPIO controller.
+ - "renesas,rcar-gen1-gpio": for a generic R-Car Gen1 GPIO controller.
+ - "renesas,rcar-gen2-gpio": for a generic R-Car Gen2 or RZ/G1 GPIO controller.
+ - "renesas,rcar-gen3-gpio": for a generic R-Car Gen3 GPIO controller.
+ - "renesas,gpio-rcar": deprecated.
+
+ When compatible with the generic version nodes must list the
+ SoC-specific version corresponding to the platform first followed by
+ the generic version.
- reg: Base address and length of each memory resource used by the GPIO
controller hardware module.
@@ -43,7 +51,7 @@ interrupt-controller/interrupts.txt.
Example: R8A7779 (R-Car H1) GPIO controller nodes
gpio0: gpio@ffc40000 {
- compatible = "renesas,gpio-r8a7779", "renesas,gpio-rcar";
+ compatible = "renesas,gpio-r8a7779", "renesas,rcar-gen1-gpio";
reg = <0xffc40000 0x2c>;
interrupt-parent = <&gic>;
interrupts = <0 141 0x4>;
@@ -55,7 +63,7 @@ Example: R8A7779 (R-Car H1) GPIO controller nodes
};
...
gpio6: gpio@ffc46000 {
- compatible = "renesas,gpio-r8a7779", "renesas,gpio-rcar";
+ compatible = "renesas,gpio-r8a7779", "renesas,rcar-gen1-gpio";
reg = <0xffc46000 0x2c>;
interrupt-parent = <&gic>;
interrupts = <0 147 0x4>;
diff --git a/Documentation/devicetree/bindings/hwmon/aspeed-pwm-tacho.txt b/Documentation/devicetree/bindings/hwmon/aspeed-pwm-tacho.txt
index cf4460564adb..367c8203213b 100644
--- a/Documentation/devicetree/bindings/hwmon/aspeed-pwm-tacho.txt
+++ b/Documentation/devicetree/bindings/hwmon/aspeed-pwm-tacho.txt
@@ -11,6 +11,8 @@ Required properties for pwm-tacho node:
- #size-cells : should be 1.
+- #cooling-cells: should be 2.
+
- reg : address and length of the register set for the device.
- pinctrl-names : a pinctrl state named "default" must be defined.
@@ -28,12 +30,17 @@ fan subnode format:
Under fan subnode there can upto 8 child nodes, with each child node
representing a fan. If there are 8 fans each fan can have one PWM port and
one/two Fan tach inputs.
+For PWM port can be configured cooling-levels to create cooling device.
+Cooling device could be bound to a thermal zone for the thermal control.
Required properties for each child node:
- reg : should specify PWM source port.
integer value in the range 0 to 7 with 0 indicating PWM port A and
7 indicating PWM port H.
+- cooling-levels: PWM duty cycle values in a range from 0 to 255
+ which correspond to thermal cooling states.
+
- aspeed,fan-tach-ch : should specify the Fan tach input channel.
integer value in the range 0 through 15, with 0 indicating
Fan tach channel 0 and 15 indicating Fan tach channel 15.
@@ -50,6 +57,7 @@ pwm_tacho_fixed_clk: fixedclk {
pwm_tacho: pwmtachocontroller@1e786000 {
#address-cells = <1>;
#size-cells = <1>;
+ #cooling-cells = <2>;
reg = <0x1E786000 0x1000>;
compatible = "aspeed,ast2500-pwm-tacho";
clocks = <&pwm_tacho_fixed_clk>;
@@ -58,6 +66,7 @@ pwm_tacho: pwmtachocontroller@1e786000 {
fan@0 {
reg = <0x00>;
+ cooling-levels = /bits/ 8 <125 151 177 203 229 255>;
aspeed,fan-tach-ch = /bits/ 8 <0x00>;
};
diff --git a/Documentation/devicetree/bindings/hwmon/ibm,cffps1.txt b/Documentation/devicetree/bindings/hwmon/ibm,cffps1.txt
new file mode 100644
index 000000000000..f68a0a68fc52
--- /dev/null
+++ b/Documentation/devicetree/bindings/hwmon/ibm,cffps1.txt
@@ -0,0 +1,21 @@
+Device-tree bindings for IBM Common Form Factor Power Supply Version 1
+----------------------------------------------------------------------
+
+Required properties:
+ - compatible = "ibm,cffps1";
+ - reg = < I2C bus address >; : Address of the power supply on the
+ I2C bus.
+
+Example:
+
+ i2c-bus@100 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ #interrupt-cells = <1>;
+ < more properties >
+
+ power-supply@68 {
+ compatible = "ibm,cffps1";
+ reg = <0x68>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/hwmon/ltq-cputemp.txt b/Documentation/devicetree/bindings/hwmon/ltq-cputemp.txt
new file mode 100644
index 000000000000..33fd00a987c7
--- /dev/null
+++ b/Documentation/devicetree/bindings/hwmon/ltq-cputemp.txt
@@ -0,0 +1,10 @@
+Lantiq cpu temperatur sensor
+
+Requires node properties:
+- compatible value :
+ "lantiq,cputemp"
+
+Example:
+ cputemp@0 {
+ compatible = "lantiq,cputemp";
+ };
diff --git a/Documentation/devicetree/bindings/iio/adc/at91-sama5d2_adc.txt b/Documentation/devicetree/bindings/iio/adc/at91-sama5d2_adc.txt
index 3223684a643b..552e7a83951d 100644
--- a/Documentation/devicetree/bindings/iio/adc/at91-sama5d2_adc.txt
+++ b/Documentation/devicetree/bindings/iio/adc/at91-sama5d2_adc.txt
@@ -11,6 +11,11 @@ Required properties:
- atmel,min-sample-rate-hz: Minimum sampling rate, it depends on SoC.
- atmel,max-sample-rate-hz: Maximum sampling rate, it depends on SoC.
- atmel,startup-time-ms: Startup time expressed in ms, it depends on SoC.
+ - atmel,trigger-edge-type: One of possible edge types for the ADTRG hardware
+ trigger pin. When the specific edge type is detected, the conversion will
+ start. Possible values are rising, falling, or both.
+ This property uses the IRQ edge types values: IRQ_TYPE_EDGE_RISING ,
+ IRQ_TYPE_EDGE_FALLING or IRQ_TYPE_EDGE_BOTH
Example:
@@ -25,4 +30,5 @@ adc: adc@fc030000 {
atmel,startup-time-ms = <4>;
vddana-supply = <&vdd_3v3_lp_reg>;
vref-supply = <&vdd_3v3_lp_reg>;
+ atmel,trigger-edge-type = <IRQ_TYPE_EDGE_BOTH>;
}
diff --git a/Documentation/devicetree/bindings/iio/adc/mt6577_auxadc.txt b/Documentation/devicetree/bindings/iio/adc/mt6577_auxadc.txt
index 68c45cbbe3d9..64dc4843c180 100644
--- a/Documentation/devicetree/bindings/iio/adc/mt6577_auxadc.txt
+++ b/Documentation/devicetree/bindings/iio/adc/mt6577_auxadc.txt
@@ -12,6 +12,7 @@ for the Thermal Controller which holds a phandle to the AUXADC.
Required properties:
- compatible: Should be one of:
- "mediatek,mt2701-auxadc": For MT2701 family of SoCs
+ - "mediatek,mt7622-auxadc": For MT7622 family of SoCs
- "mediatek,mt8173-auxadc": For MT8173 family of SoCs
- reg: Address range of the AUXADC unit.
- clocks: Should contain a clock specifier for each entry in clock-names
diff --git a/Documentation/devicetree/bindings/iio/adc/rockchip-saradc.txt b/Documentation/devicetree/bindings/iio/adc/rockchip-saradc.txt
index e0a9b9d6d6fd..c2c50b59873d 100644
--- a/Documentation/devicetree/bindings/iio/adc/rockchip-saradc.txt
+++ b/Documentation/devicetree/bindings/iio/adc/rockchip-saradc.txt
@@ -6,6 +6,7 @@ Required properties:
- "rockchip,rk3066-tsadc": for rk3036
- "rockchip,rk3328-saradc", "rockchip,rk3399-saradc": for rk3328
- "rockchip,rk3399-saradc": for rk3399
+ - "rockchip,rv1108-saradc", "rockchip,rk3399-saradc": for rv1108
- reg: physical base address of the controller and length of memory mapped
region.
diff --git a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt
index 8310073f14e1..48bfcaa3ffcd 100644
--- a/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt
+++ b/Documentation/devicetree/bindings/iio/adc/st,stm32-adc.txt
@@ -74,6 +74,11 @@ Optional properties:
* can be 6, 8, 10 or 12 on stm32f4
* can be 8, 10, 12, 14 or 16 on stm32h7
Default is maximum resolution if unset.
+- st,min-sample-time-nsecs: Minimum sampling time in nanoseconds.
+ Depending on hardware (board) e.g. high/low analog input source impedance,
+ fine tune of ADC sampling time may be recommended.
+ This can be either one value or an array that matches 'st,adc-channels' list,
+ to set sample time resp. for all channels, or independently for each channel.
Example:
adc: adc@40012000 {
diff --git a/Documentation/devicetree/bindings/iio/dac/st,stm32-dac.txt b/Documentation/devicetree/bindings/iio/dac/st,stm32-dac.txt
index bcee71f808d0..bf2925c671c6 100644
--- a/Documentation/devicetree/bindings/iio/dac/st,stm32-dac.txt
+++ b/Documentation/devicetree/bindings/iio/dac/st,stm32-dac.txt
@@ -10,7 +10,9 @@ current.
Contents of a stm32 dac root node:
-----------------------------------
Required properties:
-- compatible: Must be "st,stm32h7-dac-core".
+- compatible: Should be one of:
+ "st,stm32f4-dac-core"
+ "st,stm32h7-dac-core"
- reg: Offset and length of the device's register set.
- clocks: Must contain an entry for pclk (which feeds the peripheral bus
interface)
diff --git a/Documentation/devicetree/bindings/iio/humidity/hdc100x.txt b/Documentation/devicetree/bindings/iio/humidity/hdc100x.txt
new file mode 100644
index 000000000000..c52333bdfd19
--- /dev/null
+++ b/Documentation/devicetree/bindings/iio/humidity/hdc100x.txt
@@ -0,0 +1,17 @@
+* HDC100x temperature + humidity sensors
+
+Required properties:
+ - compatible: Should contain one of the following:
+ ti,hdc1000
+ ti,hdc1008
+ ti,hdc1010
+ ti,hdc1050
+ ti,hdc1080
+ - reg: i2c address of the sensor
+
+Example:
+
+hdc100x@40 {
+ compatible = "ti,hdc1000";
+ reg = <0x40>;
+};
diff --git a/Documentation/devicetree/bindings/iio/humidity/hts221.txt b/Documentation/devicetree/bindings/iio/humidity/hts221.txt
index b20ab9c12080..10adeb0d703d 100644
--- a/Documentation/devicetree/bindings/iio/humidity/hts221.txt
+++ b/Documentation/devicetree/bindings/iio/humidity/hts221.txt
@@ -5,9 +5,18 @@ Required properties:
- reg: i2c address of the sensor / spi cs line
Optional properties:
+- drive-open-drain: the interrupt/data ready line will be configured
+ as open drain, which is useful if several sensors share the same
+ interrupt line. This is a boolean property.
+ If the requested interrupt is configured as IRQ_TYPE_LEVEL_HIGH or
+ IRQ_TYPE_EDGE_RISING a pull-down resistor is needed to drive the line
+ when it is not active, whereas a pull-up one is needed when interrupt
+ line is configured as IRQ_TYPE_LEVEL_LOW or IRQ_TYPE_EDGE_FALLING.
+ Refer to pinctrl/pinctrl-bindings.txt for the property description.
- interrupt-parent: should be the phandle for the interrupt controller
- interrupts: interrupt mapping for IRQ. It should be configured with
- flags IRQ_TYPE_LEVEL_HIGH or IRQ_TYPE_EDGE_RISING.
+ flags IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_EDGE_RISING, IRQ_TYPE_LEVEL_LOW or
+ IRQ_TYPE_EDGE_FALLING.
Refer to interrupt-controller/interrupts.txt for generic interrupt
client node bindings.
diff --git a/Documentation/devicetree/bindings/iio/humidity/htu21.txt b/Documentation/devicetree/bindings/iio/humidity/htu21.txt
new file mode 100644
index 000000000000..97d79636f7ae
--- /dev/null
+++ b/Documentation/devicetree/bindings/iio/humidity/htu21.txt
@@ -0,0 +1,13 @@
+*HTU21 - Measurement-Specialties htu21 temperature & humidity sensor and humidity part of MS8607 sensor
+
+Required properties:
+
+ - compatible: should be "meas,htu21" or "meas,ms8607-humidity"
+ - reg: I2C address of the sensor
+
+Example:
+
+htu21@40 {
+ compatible = "meas,htu21";
+ reg = <0x40>;
+};
diff --git a/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt b/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt
index 6f28ff55f3ec..1ff1af799c76 100644
--- a/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt
+++ b/Documentation/devicetree/bindings/iio/imu/st_lsm6dsx.txt
@@ -11,6 +11,14 @@ Required properties:
Optional properties:
- st,drdy-int-pin: the pin on the package that will be used to signal
"data ready" (valid values: 1 or 2).
+- drive-open-drain: the interrupt/data ready line will be configured
+ as open drain, which is useful if several sensors share the same
+ interrupt line. This is a boolean property.
+ (This binding is taken from pinctrl/pinctrl-bindings.txt)
+ If the requested interrupt is configured as IRQ_TYPE_LEVEL_HIGH or
+ IRQ_TYPE_EDGE_RISING a pull-down resistor is needed to drive the line
+ when it is not active, whereas a pull-up one is needed when interrupt
+ line is configured as IRQ_TYPE_LEVEL_LOW or IRQ_TYPE_EDGE_FALLING.
- interrupt-parent: should be the phandle for the interrupt controller
- interrupts: interrupt mapping for IRQ. It should be configured with
flags IRQ_TYPE_LEVEL_HIGH, IRQ_TYPE_EDGE_RISING, IRQ_TYPE_LEVEL_LOW or
diff --git a/Documentation/devicetree/bindings/iio/pressure/ms5637.txt b/Documentation/devicetree/bindings/iio/pressure/ms5637.txt
new file mode 100644
index 000000000000..1f43ffa068ac
--- /dev/null
+++ b/Documentation/devicetree/bindings/iio/pressure/ms5637.txt
@@ -0,0 +1,17 @@
+* MS5637 - Measurement-Specialties MS5637, MS5805, MS5837 and MS8607 pressure & temperature sensor
+
+Required properties:
+
+ -compatible: should be one of the following
+ meas,ms5637
+ meas,ms5805
+ meas,ms5837
+ meas,ms8607-temppressure
+ -reg: I2C address of the sensor
+
+Example:
+
+ms5637@76 {
+ compatible = "meas,ms5637";
+ reg = <0x76>;
+};
diff --git a/Documentation/devicetree/bindings/iio/st-sensors.txt b/Documentation/devicetree/bindings/iio/st-sensors.txt
index eaa8fbba34e2..9ec6f5ce54fc 100644
--- a/Documentation/devicetree/bindings/iio/st-sensors.txt
+++ b/Documentation/devicetree/bindings/iio/st-sensors.txt
@@ -45,6 +45,7 @@ Accelerometers:
- st,lis2dh12-accel
- st,h3lis331dl-accel
- st,lng2dm-accel
+- st,lis3l02dq
Gyroscopes:
- st,l3g4200d-gyro
@@ -52,6 +53,7 @@ Gyroscopes:
- st,lsm330dl-gyro
- st,lsm330dlc-gyro
- st,l3gd20-gyro
+- st,l3gd20h-gyro
- st,l3g4is-gyro
- st,lsm330-gyro
- st,lsm9ds0-gyro
@@ -62,6 +64,7 @@ Magnetometers:
- st,lsm303dlhc-magn
- st,lsm303dlm-magn
- st,lis3mdl-magn
+- st,lis2mdl
Pressure sensors:
- st,lps001wp-press
diff --git a/Documentation/devicetree/bindings/iio/temperature/tsys01.txt b/Documentation/devicetree/bindings/iio/temperature/tsys01.txt
new file mode 100644
index 000000000000..0d5cc5595d0c
--- /dev/null
+++ b/Documentation/devicetree/bindings/iio/temperature/tsys01.txt
@@ -0,0 +1,19 @@
+* TSYS01 - Measurement Specialties temperature sensor
+
+Required properties:
+
+ - compatible: should be "meas,tsys01"
+ - reg: I2C address of the sensor (changeable via CSB pin)
+
+ ------------------------
+ | CSB | Device Address |
+ ------------------------
+ 1 0x76
+ 0 0x77
+
+Example:
+
+tsys01@76 {
+ compatible = "meas,tsys01";
+ reg = <0x76>;
+};
diff --git a/Documentation/devicetree/bindings/iio/timer/stm32-timer-trigger.txt b/Documentation/devicetree/bindings/iio/timer/stm32-timer-trigger.txt
index 55a653d15303..b8e8c769d434 100644
--- a/Documentation/devicetree/bindings/iio/timer/stm32-timer-trigger.txt
+++ b/Documentation/devicetree/bindings/iio/timer/stm32-timer-trigger.txt
@@ -4,7 +4,9 @@ Must be a sub-node of an STM32 Timers device tree node.
See ../mfd/stm32-timers.txt for details about the parent node.
Required parameters:
-- compatible: Must be "st,stm32-timer-trigger".
+- compatible: Must be one of:
+ "st,stm32-timer-trigger"
+ "st,stm32h7-timer-trigger"
- reg: Identify trigger hardware block.
Example:
@@ -14,7 +16,7 @@ Example:
compatible = "st,stm32-timers";
reg = <0x40010000 0x400>;
clocks = <&rcc 0 160>;
- clock-names = "clk_int";
+ clock-names = "int";
timer@0 {
compatible = "st,stm32-timer-trigger";
diff --git a/Documentation/devicetree/bindings/interrupt-controller/fsl,ls-scfg-msi.txt b/Documentation/devicetree/bindings/interrupt-controller/fsl,ls-scfg-msi.txt
index 9e389493203f..49ccabbfa6f3 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/fsl,ls-scfg-msi.txt
+++ b/Documentation/devicetree/bindings/interrupt-controller/fsl,ls-scfg-msi.txt
@@ -4,8 +4,10 @@ Required properties:
- compatible: should be "fsl,<soc-name>-msi" to identify
Layerscape PCIe MSI controller block such as:
- "fsl,1s1021a-msi"
- "fsl,1s1043a-msi"
+ "fsl,ls1021a-msi"
+ "fsl,ls1043a-msi"
+ "fsl,ls1046a-msi"
+ "fsl,ls1043a-v1.1-msi"
- msi-controller: indicates that this is a PCIe MSI controller node
- reg: physical base address of the controller and length of memory mapped.
- interrupts: an interrupt to the parent interrupt controller.
@@ -23,7 +25,7 @@ MSI controller node
Examples:
msi1: msi-controller@1571000 {
- compatible = "fsl,1s1043a-msi";
+ compatible = "fsl,ls1043a-msi";
reg = <0x0 0x1571000 0x0 0x8>,
msi-controller;
interrupts = <0 116 0x4>;
diff --git a/Documentation/devicetree/bindings/interrupt-controller/socionext,uniphier-aidet.txt b/Documentation/devicetree/bindings/interrupt-controller/socionext,uniphier-aidet.txt
new file mode 100644
index 000000000000..48e71d3ac2ad
--- /dev/null
+++ b/Documentation/devicetree/bindings/interrupt-controller/socionext,uniphier-aidet.txt
@@ -0,0 +1,32 @@
+UniPhier AIDET
+
+UniPhier AIDET (ARM Interrupt Detector) is an add-on block for ARM GIC (Generic
+Interrupt Controller). GIC itself can handle only high level and rising edge
+interrupts. The AIDET provides logic inverter to support low level and falling
+edge interrupts.
+
+Required properties:
+- compatible: Should be one of the following:
+ "socionext,uniphier-ld4-aidet" - for LD4 SoC
+ "socionext,uniphier-pro4-aidet" - for Pro4 SoC
+ "socionext,uniphier-sld8-aidet" - for sLD8 SoC
+ "socionext,uniphier-pro5-aidet" - for Pro5 SoC
+ "socionext,uniphier-pxs2-aidet" - for PXs2/LD6b SoC
+ "socionext,uniphier-ld11-aidet" - for LD11 SoC
+ "socionext,uniphier-ld20-aidet" - for LD20 SoC
+ "socionext,uniphier-pxs3-aidet" - for PXs3 SoC
+- reg: Specifies offset and length of the register set for the device.
+- interrupt-controller: Identifies the node as an interrupt controller
+- #interrupt-cells : Specifies the number of cells needed to encode an interrupt
+ source. The value should be 2. The first cell defines the interrupt number
+ (corresponds to the SPI interrupt number of GIC). The second cell specifies
+ the trigger type as defined in interrupts.txt in this directory.
+
+Example:
+
+ aidet: aidet@5fc20000 {
+ compatible = "socionext,uniphier-pro4-aidet";
+ reg = <0x5fc20000 0x200>;
+ interrupt-controller;
+ #interrupt-cells = <2>;
+ };
diff --git a/Documentation/devicetree/bindings/mfd/wm831x.txt b/Documentation/devicetree/bindings/mfd/wm831x.txt
index 9f8b7430673c..505709403d3f 100644
--- a/Documentation/devicetree/bindings/mfd/wm831x.txt
+++ b/Documentation/devicetree/bindings/mfd/wm831x.txt
@@ -31,6 +31,7 @@ Required properties:
../interrupt-controller/interrupts.txt
Optional sub-nodes:
+ - phys : Contains a phandle to the USB PHY.
- regulators : Contains sub-nodes for each of the regulators supplied by
the device. The regulators are bound using their names listed below:
diff --git a/Documentation/devicetree/bindings/net/anarion-gmac.txt b/Documentation/devicetree/bindings/net/anarion-gmac.txt
new file mode 100644
index 000000000000..fe678965ae69
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/anarion-gmac.txt
@@ -0,0 +1,25 @@
+* Adaptrum Anarion ethernet controller
+
+This device is a platform glue layer for stmmac.
+Please see stmmac.txt for the other unchanged properties.
+
+Required properties:
+ - compatible: Should be "adaptrum,anarion-gmac", "snps,dwmac"
+ - phy-mode: Should be "rgmii". Other modes are not currently supported.
+
+
+Examples:
+
+ gmac1: ethernet@f2014000 {
+ compatible = "adaptrum,anarion-gmac", "snps,dwmac";
+ reg = <0xf2014000 0x4000>, <0xf2018100 8>;
+
+ interrupt-parent = <&core_intc>;
+ interrupts = <21>;
+ interrupt-names = "macirq";
+
+ clocks = <&core_clk>;
+ clock-names = "stmmaceth";
+
+ phy-mode = "rgmii";
+ };
diff --git a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
new file mode 100644
index 000000000000..4194ff7e6ee6
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
@@ -0,0 +1,35 @@
+Broadcom Bluetooth Chips
+---------------------
+
+This documents the binding structure and common properties for serial
+attached Broadcom devices.
+
+Serial attached Broadcom devices shall be a child node of the host UART
+device the slave device is attached to.
+
+Required properties:
+
+ - compatible: should contain one of the following:
+ * "brcm,bcm43438-bt"
+
+Optional properties:
+
+ - max-speed: see Documentation/devicetree/bindings/serial/slave-device.txt
+ - shutdown-gpios: GPIO specifier, used to enable the BT module
+ - device-wakeup-gpios: GPIO specifier, used to wakeup the controller
+ - host-wakeup-gpios: GPIO specifier, used to wakeup the host processor
+ - clocks: clock specifier if external clock provided to the controller
+ - clock-names: should be "extclk"
+
+
+Example:
+
+&uart2 {
+ pinctrl-names = "default";
+ pinctrl-0 = <&uart2_pins>;
+
+ bluetooth {
+ compatible = "brcm,bcm43438-bt";
+ max-speed = <921600>;
+ };
+};
diff --git a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt b/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
deleted file mode 100644
index 725f3b187886..000000000000
--- a/Documentation/devicetree/bindings/net/dwmac-sun8i.txt
+++ /dev/null
@@ -1,84 +0,0 @@
-* Allwinner sun8i GMAC ethernet controller
-
-This device is a platform glue layer for stmmac.
-Please see stmmac.txt for the other unchanged properties.
-
-Required properties:
-- compatible: should be one of the following string:
- "allwinner,sun8i-a83t-emac"
- "allwinner,sun8i-h3-emac"
- "allwinner,sun8i-v3s-emac"
- "allwinner,sun50i-a64-emac"
-- reg: address and length of the register for the device.
-- interrupts: interrupt for the device
-- interrupt-names: should be "macirq"
-- clocks: A phandle to the reference clock for this device
-- clock-names: should be "stmmaceth"
-- resets: A phandle to the reset control for this device
-- reset-names: should be "stmmaceth"
-- phy-mode: See ethernet.txt
-- phy-handle: See ethernet.txt
-- #address-cells: shall be 1
-- #size-cells: shall be 0
-- syscon: A phandle to the syscon of the SoC with one of the following
- compatible string:
- - allwinner,sun8i-h3-system-controller
- - allwinner,sun8i-v3s-system-controller
- - allwinner,sun50i-a64-system-controller
- - allwinner,sun8i-a83t-system-controller
-
-Optional properties:
-- allwinner,tx-delay-ps: TX clock delay chain value in ps. Range value is 0-700. Default is 0)
-- allwinner,rx-delay-ps: RX clock delay chain value in ps. Range value is 0-3100. Default is 0)
-Both delay properties need to be a multiple of 100. They control the delay for
-external PHY.
-
-Optional properties for the following compatibles:
- - "allwinner,sun8i-h3-emac",
- - "allwinner,sun8i-v3s-emac":
-- allwinner,leds-active-low: EPHY LEDs are active low
-
-Required child node of emac:
-- mdio bus node: should be named mdio
-
-Required properties of the mdio node:
-- #address-cells: shall be 1
-- #size-cells: shall be 0
-
-The device node referenced by "phy" or "phy-handle" should be a child node
-of the mdio node. See phy.txt for the generic PHY bindings.
-
-Required properties of the phy node with the following compatibles:
- - "allwinner,sun8i-h3-emac",
- - "allwinner,sun8i-v3s-emac":
-- clocks: a phandle to the reference clock for the EPHY
-- resets: a phandle to the reset control for the EPHY
-
-Example:
-
-emac: ethernet@1c0b000 {
- compatible = "allwinner,sun8i-h3-emac";
- syscon = <&syscon>;
- reg = <0x01c0b000 0x104>;
- interrupts = <GIC_SPI 82 IRQ_TYPE_LEVEL_HIGH>;
- interrupt-names = "macirq";
- resets = <&ccu RST_BUS_EMAC>;
- reset-names = "stmmaceth";
- clocks = <&ccu CLK_BUS_EMAC>;
- clock-names = "stmmaceth";
- #address-cells = <1>;
- #size-cells = <0>;
-
- phy-handle = <&int_mii_phy>;
- phy-mode = "mii";
- allwinner,leds-active-low;
- mdio: mdio {
- #address-cells = <1>;
- #size-cells = <0>;
- int_mii_phy: ethernet-phy@1 {
- reg = <1>;
- clocks = <&ccu CLK_BUS_EPHY>;
- resets = <&ccu RST_BUS_EPHY>;
- };
- };
-};
diff --git a/Documentation/devicetree/bindings/net/marvell-pp2.txt b/Documentation/devicetree/bindings/net/marvell-pp2.txt
index 6b4956beff8c..c78f3187dfea 100644
--- a/Documentation/devicetree/bindings/net/marvell-pp2.txt
+++ b/Documentation/devicetree/bindings/net/marvell-pp2.txt
@@ -41,6 +41,11 @@ Optional properties (port):
- marvell,loopback: port is loopback mode
- phy: a phandle to a phy node defining the PHY address (as the reg
property, a single integer).
+- interrupt-names: if more than a single interrupt for rx is given, must
+ be the name associated to the interrupts listed. Valid
+ names are: "tx-cpu0", "tx-cpu1", "tx-cpu2", "tx-cpu3",
+ "rx-shared", "link".
+- marvell,system-controller: a phandle to the system controller.
Example for marvell,armada-375-pp2:
@@ -80,19 +85,37 @@ cpm_ethernet: ethernet@0 {
clock-names = "pp_clk", "gop_clk", "gp_clk";
eth0: eth0 {
- interrupts = <GIC_SPI 37 IRQ_TYPE_LEVEL_HIGH>;
+ interrupts = <ICU_GRP_NSR 39 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 43 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 47 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 51 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 55 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "tx-cpu0", "tx-cpu1", "tx-cpu2",
+ "tx-cpu3", "rx-shared";
port-id = <0>;
gop-port-id = <0>;
};
eth1: eth1 {
- interrupts = <GIC_SPI 38 IRQ_TYPE_LEVEL_HIGH>;
+ interrupts = <ICU_GRP_NSR 40 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 44 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 48 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 52 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 56 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "tx-cpu0", "tx-cpu1", "tx-cpu2",
+ "tx-cpu3", "rx-shared";
port-id = <1>;
gop-port-id = <2>;
};
eth2: eth2 {
- interrupts = <GIC_SPI 39 IRQ_TYPE_LEVEL_HIGH>;
+ interrupts = <ICU_GRP_NSR 41 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 45 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 49 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 53 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 57 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "tx-cpu0", "tx-cpu1", "tx-cpu2",
+ "tx-cpu3", "rx-shared";
port-id = <2>;
gop-port-id = <3>;
};
diff --git a/Documentation/devicetree/bindings/net/mediatek-net.txt b/Documentation/devicetree/bindings/net/mediatek-net.txt
index c7194e87d5f4..1d1168b805cc 100644
--- a/Documentation/devicetree/bindings/net/mediatek-net.txt
+++ b/Documentation/devicetree/bindings/net/mediatek-net.txt
@@ -7,24 +7,30 @@ have dual GMAC each represented by a child node..
* Ethernet controller node
Required properties:
-- compatible: Should be "mediatek,mt2701-eth"
+- compatible: Should be
+ "mediatek,mt2701-eth": for MT2701 SoC
+ "mediatek,mt7623-eth", "mediatek,mt2701-eth": for MT7623 SoC
+ "mediatek,mt7622-eth": for MT7622 SoC
- reg: Address and length of the register set for the device
- interrupts: Should contain the three frame engines interrupts in numeric
order. These are fe_int0, fe_int1 and fe_int2.
- clocks: the clock used by the core
- clock-names: the names of the clock listed in the clocks property. These are
- "ethif", "esw", "gp2", "gp1"
+ "ethif", "esw", "gp2", "gp1" : For MT2701 and MT7623 SoC
+ "ethif", "esw", "gp0", "gp1", "gp2", "sgmii_tx250m", "sgmii_rx250m",
+ "sgmii_cdr_ref", "sgmii_cdr_fb", "sgmii_ck", "eth2pll" : For MT7622 SoC
- power-domains: phandle to the power domain that the ethernet is part of
- resets: Should contain a phandle to the ethsys reset signal
- reset-names: Should contain the reset signal name "eth"
- mediatek,ethsys: phandle to the syscon node that handles the port setup
+- mediatek,sgmiisys: phandle to the syscon node that handles the SGMII setup
+ which is required for those SoCs equipped with SGMII such as MT7622 SoC.
- mediatek,pctl: phandle to the syscon node that handles the ports slew rate
and driver current
Optional properties:
- interrupt-parent: Should be the phandle for the interrupt controller
that services interrupts for this device
-
* Ethernet MAC node
Required properties:
diff --git a/Documentation/devicetree/bindings/net/phy.txt b/Documentation/devicetree/bindings/net/phy.txt
index b55857696fc3..d3c24d5ffa9a 100644
--- a/Documentation/devicetree/bindings/net/phy.txt
+++ b/Documentation/devicetree/bindings/net/phy.txt
@@ -52,6 +52,11 @@ Optional Properties:
Mark the corresponding energy efficient ethernet mode as broken and
request the ethernet to stop advertising it.
+- phy-is-integrated: If set, indicates that the PHY is integrated into the same
+ physical package as the Ethernet MAC. If needed, muxers should be configured
+ to ensure the integrated PHY is used. The absence of this property indicates
+ the muxers should be configured so that the external PHY is used.
+
Example:
ethernet-phy@0 {
diff --git a/Documentation/devicetree/bindings/net/renesas,ravb.txt b/Documentation/devicetree/bindings/net/renesas,ravb.txt
index b519503be51a..16723535e1aa 100644
--- a/Documentation/devicetree/bindings/net/renesas,ravb.txt
+++ b/Documentation/devicetree/bindings/net/renesas,ravb.txt
@@ -4,19 +4,25 @@ This file provides information on what the device node for the Ethernet AVB
interface contains.
Required properties:
-- compatible: "renesas,etheravb-r8a7790" if the device is a part of R8A7790 SoC.
- "renesas,etheravb-r8a7791" if the device is a part of R8A7791 SoC.
- "renesas,etheravb-r8a7792" if the device is a part of R8A7792 SoC.
- "renesas,etheravb-r8a7793" if the device is a part of R8A7793 SoC.
- "renesas,etheravb-r8a7794" if the device is a part of R8A7794 SoC.
- "renesas,etheravb-r8a7795" if the device is a part of R8A7795 SoC.
- "renesas,etheravb-r8a7796" if the device is a part of R8A7796 SoC.
- "renesas,etheravb-rcar-gen2" for generic R-Car Gen 2 compatible interface.
- "renesas,etheravb-rcar-gen3" for generic R-Car Gen 3 compatible interface.
+- compatible: Must contain one or more of the following:
+ - "renesas,etheravb-r8a7743" for the R8A7743 SoC.
+ - "renesas,etheravb-r8a7745" for the R8A7745 SoC.
+ - "renesas,etheravb-r8a7790" for the R8A7790 SoC.
+ - "renesas,etheravb-r8a7791" for the R8A7791 SoC.
+ - "renesas,etheravb-r8a7792" for the R8A7792 SoC.
+ - "renesas,etheravb-r8a7793" for the R8A7793 SoC.
+ - "renesas,etheravb-r8a7794" for the R8A7794 SoC.
+ - "renesas,etheravb-rcar-gen2" as a fallback for the above
+ R-Car Gen2 and RZ/G1 devices.
- When compatible with the generic version, nodes must list the
- SoC-specific version corresponding to the platform first
- followed by the generic version.
+ - "renesas,etheravb-r8a7795" for the R8A7795 SoC.
+ - "renesas,etheravb-r8a7796" for the R8A7796 SoC.
+ - "renesas,etheravb-rcar-gen3" as a fallback for the above
+ R-Car Gen3 devices.
+
+ When compatible with the generic version, nodes must list the
+ SoC-specific version corresponding to the platform first followed by
+ the generic version.
- reg: offset and length of (1) the register block and (2) the stream buffer.
- interrupts: A list of interrupt-specifiers, one for each entry in
diff --git a/Documentation/devicetree/bindings/net/rockchip-dwmac.txt b/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
index 8f427550720a..c1325387632c 100644
--- a/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
+++ b/Documentation/devicetree/bindings/net/rockchip-dwmac.txt
@@ -10,6 +10,7 @@ Required properties:
"rockchip,rk3366-gmac": found on RK3366 SoCs
"rockchip,rk3368-gmac": found on RK3368 SoCs
"rockchip,rk3399-gmac": found on RK3399 SoCs
+ "rockchip,rv1108-gmac": found on RV1108 SoCs
- reg: addresses and length of the register sets for the device.
- interrupts: Should contain the GMAC interrupts.
- interrupt-names: Should contain the interrupt names "macirq".
diff --git a/Documentation/devicetree/bindings/net/xilinx_axienet.txt b/Documentation/devicetree/bindings/net/xilinx_axienet.txt
new file mode 100644
index 000000000000..38f9ec076743
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/xilinx_axienet.txt
@@ -0,0 +1,55 @@
+XILINX AXI ETHERNET Device Tree Bindings
+--------------------------------------------------------
+
+Also called AXI 1G/2.5G Ethernet Subsystem, the xilinx axi ethernet IP core
+provides connectivity to an external ethernet PHY supporting different
+interfaces: MII, GMII, RGMII, SGMII, 1000BaseX. It also includes two
+segments of memory for buffering TX and RX, as well as the capability of
+offloading TX/RX checksum calculation off the processor.
+
+Management configuration is done through the AXI interface, while payload is
+sent and received through means of an AXI DMA controller. This driver
+includes the DMA driver code, so this driver is incompatible with AXI DMA
+driver.
+
+For more details about mdio please refer phy.txt file in the same directory.
+
+Required properties:
+- compatible : Must be one of "xlnx,axi-ethernet-1.00.a",
+ "xlnx,axi-ethernet-1.01.a", "xlnx,axi-ethernet-2.01.a"
+- reg : Address and length of the IO space.
+- interrupts : Should be a list of two interrupt, TX and RX.
+- phy-handle : Should point to the external phy device.
+ See ethernet.txt file in the same directory.
+- xlnx,rxmem : Set to allocated memory buffer for Rx/Tx in the hardware
+
+Optional properties:
+- phy-mode : See ethernet.txt
+- xlnx,phy-type : Deprecated, do not use, but still accepted in preference
+ to phy-mode.
+- xlnx,txcsum : 0 or empty for disabling TX checksum offload,
+ 1 to enable partial TX checksum offload,
+ 2 to enable full TX checksum offload
+- xlnx,rxcsum : Same values as xlnx,txcsum but for RX checksum offload
+
+Example:
+ axi_ethernet_eth: ethernet@40c00000 {
+ compatible = "xlnx,axi-ethernet-1.00.a";
+ device_type = "network";
+ interrupt-parent = <&microblaze_0_axi_intc>;
+ interrupts = <2 0>;
+ phy-mode = "mii";
+ reg = <0x40c00000 0x40000>;
+ xlnx,rxcsum = <0x2>;
+ xlnx,rxmem = <0x800>;
+ xlnx,txcsum = <0x2>;
+ phy-handle = <&phy0>;
+ axi_ethernetlite_0_mdio: mdio {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ phy0: phy@0 {
+ device_type = "ethernet-phy";
+ reg = <1>;
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt b/Documentation/devicetree/bindings/phy/phy-mtk-tphy.txt
index 0acc5a99fb79..faf18084a33a 100644
--- a/Documentation/devicetree/bindings/phy/phy-mt65xx-usb.txt
+++ b/Documentation/devicetree/bindings/phy/phy-mtk-tphy.txt
@@ -1,13 +1,18 @@
-mt65xx USB3.0 PHY binding
+MediaTek T-PHY binding
--------------------------
-This binding describes a usb3.0 phy for mt65xx platforms of Medaitek SoC.
+T-phy controller supports physical layer functionality for a number of
+controllers on MediaTek SoCs, such as, USB2.0, USB3.0, PCIe, and SATA.
Required properties (controller (parent) node):
- compatible : should be one of
- "mediatek,mt2701-u3phy"
- "mediatek,mt2712-u3phy"
- "mediatek,mt8173-u3phy"
+ "mediatek,generic-tphy-v1"
+ "mediatek,generic-tphy-v2"
+ "mediatek,mt2701-u3phy" (deprecated)
+ "mediatek,mt2712-u3phy" (deprecated)
+ "mediatek,mt8173-u3phy";
+ make use of "mediatek,generic-tphy-v1" on mt2701 instead and
+ "mediatek,generic-tphy-v2" on mt2712 instead.
- clocks : (deprecated, use port's clocks instead) a list of phandle +
clock-specifier pairs, one for each entry in clock-names
- clock-names : (deprecated, use port's one instead) must contain
@@ -35,6 +40,8 @@ Required properties (port (child) node):
cell after port phandle is phy type from:
- PHY_TYPE_USB2
- PHY_TYPE_USB3
+ - PHY_TYPE_PCIE
+ - PHY_TYPE_SATA
Example:
diff --git a/Documentation/devicetree/bindings/phy/phy-mvebu-comphy.txt b/Documentation/devicetree/bindings/phy/phy-mvebu-comphy.txt
new file mode 100644
index 000000000000..bfcf80341657
--- /dev/null
+++ b/Documentation/devicetree/bindings/phy/phy-mvebu-comphy.txt
@@ -0,0 +1,43 @@
+mvebu comphy driver
+-------------------
+
+A comphy controller can be found on Marvell Armada 7k/8k on the CP110. It
+provides a number of shared PHYs used by various interfaces (network, sata,
+usb, PCIe...).
+
+Required properties:
+
+- compatible: should be "marvell,comphy-cp110"
+- reg: should contain the comphy register location and length.
+- marvell,system-controller: should contain a phandle to the
+ system controller node.
+- #address-cells: should be 1.
+- #size-cells: should be 0.
+
+A sub-node is required for each comphy lane provided by the comphy.
+
+Required properties (child nodes):
+
+- reg: comphy lane number.
+- #phy-cells : from the generic phy bindings, must be 1. Defines the
+ input port to use for a given comphy lane.
+
+Example:
+
+ cpm_comphy: phy@120000 {
+ compatible = "marvell,comphy-cp110";
+ reg = <0x120000 0x6000>;
+ marvell,system-controller = <&cpm_syscon0>;
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ cpm_comphy0: phy@0 {
+ reg = <0>;
+ #phy-cells = <1>;
+ };
+
+ cpm_comphy1: phy@1 {
+ reg = <1>;
+ #phy-cells = <1>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt b/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt
index 84d59b0db8df..a67ef2a3874f 100644
--- a/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt
+++ b/Documentation/devicetree/bindings/phy/phy-rockchip-inno-usb2.txt
@@ -6,6 +6,7 @@ Required properties (phy (parent) node):
* "rockchip,rk3328-usb2phy"
* "rockchip,rk3366-usb2phy"
* "rockchip,rk3399-usb2phy"
+ * "rockchip,rv1108-usb2phy"
- reg : the address offset of grf for usb-phy configuration.
- #clock-cells : should be 0.
- clock-output-names : specify the 480m output clock name.
@@ -18,6 +19,10 @@ Optional properties:
usb-phy output 480m and xin24m.
Refer to clk/clock-bindings.txt for generic clock
consumer properties.
+ - rockchip,usbgrf : phandle to the syscon managing the "usb general
+ register files". When set driver will request its
+ phandle as one companion-grf for some special SoCs
+ (e.g RV1108).
Required nodes : a sub-node is required for each port the phy provides.
The sub-node name is used to identify host or otg port,
@@ -28,10 +33,14 @@ Required nodes : a sub-node is required for each port the phy provides.
Required properties (port (child) node):
- #phy-cells : must be 0. See ./phy-bindings.txt for details.
- interrupts : specify an interrupt for each entry in interrupt-names.
- - interrupt-names : a list which shall be the following entries:
+ - interrupt-names : a list which should be one of the following cases:
+ Regular case:
* "otg-id" : for the otg id interrupt.
* "otg-bvalid" : for the otg vbus interrupt.
* "linestate" : for the host/otg linestate interrupt.
+ Some SoCs use one interrupt with the above muxed together, so for these
+ * "otg-mux" : otg-port interrupt, which mux otg-id/otg-bvalid/linestate
+ to one.
Optional properties:
- phy-supply : phandle to a regulator that provides power to VBUS.
diff --git a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt
index e11c563a65ec..b6a9f2b92bab 100644
--- a/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt
+++ b/Documentation/devicetree/bindings/phy/qcom-qmp-phy.txt
@@ -6,6 +6,7 @@ controllers on Qualcomm chipsets, such as, PCIe, UFS, and USB.
Required properties:
- compatible: compatible list, contains:
+ "qcom,ipq8074-qmp-pcie-phy" for PCIe phy on IPQ8074
"qcom,msm8996-qmp-pcie-phy" for 14nm PCIe phy on msm8996,
"qcom,msm8996-qmp-usb3-phy" for 14nm USB3 phy on msm8996.
@@ -38,6 +39,8 @@ Required properties:
"phy", "common", "cfg".
For "qcom,msm8996-qmp-usb3-phy" must contain
"phy", "common".
+ For "qcom,ipq8074-qmp-pcie-phy" must contain:
+ "phy", "common".
- vdda-phy-supply: Phandle to a regulator supply to PHY core block.
- vdda-pll-supply: Phandle to 1.8V regulator supply to PHY refclk pll block.
@@ -60,6 +63,13 @@ Required properties for child node:
one for each entry in clock-names.
- clock-names: Must contain following for pcie and usb qmp phys:
"pipe<lane-number>" for pipe clock specific to each lane.
+ - clock-output-names: Name of the PHY clock that will be the parent for
+ the above pipe clock.
+
+ For "qcom,ipq8074-qmp-pcie-phy":
+ - "pcie20_phy0_pipe_clk" Pipe Clock parent
+ (or)
+ "pcie20_phy1_pipe_clk"
- resets: a list of phandles and reset controller specifier pairs,
one for each entry in reset-names.
@@ -96,6 +106,7 @@ Example:
clocks = <&gcc GCC_PCIE_0_PIPE_CLK>;
clock-names = "pipe0";
+ clock-output-names = "pcie_0_pipe_clk_src";
resets = <&gcc GCC_PCIE_0_PHY_BCR>;
reset-names = "lane0";
};
diff --git a/Documentation/devicetree/bindings/phy/ralink-usb-phy.txt b/Documentation/devicetree/bindings/phy/ralink-usb-phy.txt
new file mode 100644
index 000000000000..9d2868a437ab
--- /dev/null
+++ b/Documentation/devicetree/bindings/phy/ralink-usb-phy.txt
@@ -0,0 +1,23 @@
+Mediatek/Ralink USB PHY
+
+Required properties:
+ - compatible: "ralink,rt3352-usbphy"
+ "mediatek,mt7620-usbphy"
+ "mediatek,mt7628-usbphy"
+ - reg: required for "mediatek,mt7628-usbphy", unused otherwise
+ - #phy-cells: should be 0
+ - ralink,sysctl: a phandle to a ralink syscon register region
+ - resets: the two reset controllers for host and device
+ - reset-names: the names of the 2 reset controllers
+
+Example:
+
+usbphy: phy {
+ compatible = "mediatek,mt7628-usbphy";
+ reg = <0x10120000 0x1000>;
+ #phy-cells = <0>;
+
+ ralink,sysctl = <&sysc>;
+ resets = <&rstctrl 22 &rstctrl 25>;
+ reset-names = "host", "device";
+};
diff --git a/Documentation/devicetree/bindings/phy/sun4i-usb-phy.txt b/Documentation/devicetree/bindings/phy/sun4i-usb-phy.txt
index 005bc22938ff..cbc7847dbf6c 100644
--- a/Documentation/devicetree/bindings/phy/sun4i-usb-phy.txt
+++ b/Documentation/devicetree/bindings/phy/sun4i-usb-phy.txt
@@ -9,6 +9,7 @@ Required properties:
* allwinner,sun7i-a20-usb-phy
* allwinner,sun8i-a23-usb-phy
* allwinner,sun8i-a33-usb-phy
+ * allwinner,sun8i-a83t-usb-phy
* allwinner,sun8i-h3-usb-phy
* allwinner,sun8i-v3s-usb-phy
* allwinner,sun50i-a64-usb-phy
@@ -17,18 +18,22 @@ Required properties:
* "phy_ctrl"
* "pmu0" for H3, V3s and A64
* "pmu1"
- * "pmu2" for sun4i, sun6i or sun7i
+ * "pmu2" for sun4i, sun6i, sun7i, sun8i-a83t or sun8i-h3
+ * "pmu3" for sun8i-h3
- #phy-cells : from the generic phy bindings, must be 1
- clocks : phandle + clock specifier for the phy clocks
- clock-names :
* "usb_phy" for sun4i, sun5i or sun7i
* "usb0_phy", "usb1_phy" and "usb2_phy" for sun6i
* "usb0_phy", "usb1_phy" for sun8i
+ * "usb0_phy", "usb1_phy", "usb2_phy" and "usb2_hsic_12M" for sun8i-a83t
+ * "usb0_phy", "usb1_phy", "usb2_phy" and "usb3_phy" for sun8i-h3
- resets : a list of phandle + reset specifier pairs
- reset-names :
* "usb0_reset"
* "usb1_reset"
- * "usb2_reset" for sun4i, sun6i or sun7i
+ * "usb2_reset" for sun4i, sun6i, sun7i, sun8i-a83t or sun8i-h3
+ * "usb3_reset" for sun8i-h3
Optional properties:
- usb0_id_det-gpios : gpio phandle for reading the otg id pin value
@@ -37,6 +42,7 @@ Optional properties:
- usb0_vbus-supply : regulator phandle for controller usb0 vbus
- usb1_vbus-supply : regulator phandle for controller usb1 vbus
- usb2_vbus-supply : regulator phandle for controller usb2 vbus
+- usb3_vbus-supply : regulator phandle for controller usb3 vbus
Example:
usbphy: phy@0x01c13400 {
diff --git a/Documentation/devicetree/bindings/pinctrl/cortina,gemini-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/cortina,gemini-pinctrl.txt
new file mode 100644
index 000000000000..61466c58faae
--- /dev/null
+++ b/Documentation/devicetree/bindings/pinctrl/cortina,gemini-pinctrl.txt
@@ -0,0 +1,59 @@
+Cortina Systems Gemini pin controller
+
+This pin controller is found in the Cortina Systems Gemini SoC family,
+see further arm/gemini.txt. It is a purely group-based multiplexing pin
+controller.
+
+The pin controller node must be a subnode of the system controller node.
+
+Required properties:
+- compatible: "cortina,gemini-pinctrl"
+
+Subnodes of the pin controller contain pin control multiplexing set-up.
+Please refer to pinctrl-bindings.txt for generic pin multiplexing nodes.
+
+Example:
+
+
+syscon {
+ compatible = "cortina,gemini-syscon";
+ ...
+ pinctrl {
+ compatible = "cortina,gemini-pinctrl";
+ pinctrl-names = "default";
+ pinctrl-0 = <&dram_default_pins>, <&system_default_pins>,
+ <&vcontrol_default_pins>;
+
+ dram_default_pins: pinctrl-dram {
+ mux {
+ function = "dram";
+ groups = "dramgrp";
+ };
+ };
+ rtc_default_pins: pinctrl-rtc {
+ mux {
+ function = "rtc";
+ groups = "rtcgrp";
+ };
+ };
+ power_default_pins: pinctrl-power {
+ mux {
+ function = "power";
+ groups = "powergrp";
+ };
+ };
+ system_default_pins: pinctrl-system {
+ mux {
+ function = "system";
+ groups = "systemgrp";
+ };
+ };
+ (...)
+ uart_default_pins: pinctrl-uart {
+ mux {
+ function = "uart";
+ groups = "uartrxtxgrp";
+ };
+ };
+ };
+};
diff --git a/Documentation/devicetree/bindings/pinctrl/fsl,imx7ulp-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/fsl,imx7ulp-pinctrl.txt
new file mode 100644
index 000000000000..44ad670ae11e
--- /dev/null
+++ b/Documentation/devicetree/bindings/pinctrl/fsl,imx7ulp-pinctrl.txt
@@ -0,0 +1,61 @@
+* Freescale i.MX7ULP IOMUX Controller
+
+i.MX 7ULP has three IOMUXC instances: IOMUXC0 for M4 ports, IOMUXC1 for A7
+ports and IOMUXC DDR for DDR interface.
+
+Note:
+This binding doc is only for the IOMUXC1 support in A7 Domain and it only
+supports generic pin config.
+
+Please also refer pinctrl-bindings.txt in this directory for generic pinctrl
+binding.
+
+=== Pin Controller Node ===
+
+Required properties:
+- compatible: "fsl,imx7ulp-iomuxc1"
+- reg: Should contain the base physical address and size of the iomuxc
+ registers.
+
+=== Pin Configuration Node ===
+- pinmux: One integers array, represents a group of pins mux setting.
+ The format is pinmux = <PIN_FUNC_ID>, PIN_FUNC_ID is a pin working on
+ a specific function.
+
+ NOTE: i.MX7ULP PIN_FUNC_ID consists of 4 integers as it shares one mux
+ and config register as follows:
+ <mux_conf_reg input_reg mux_mode input_val>
+
+ Refer to imx7ulp-pinfunc.h in in device tree source folder for all
+ available imx7ulp PIN_FUNC_ID.
+
+Optional Properties:
+- drive-strength Integer. Controls Drive Strength
+ 0: Standard
+ 1: Hi Driver
+- drive-push-pull Bool. Enable Pin Push-pull
+- drive-open-drain Bool. Enable Pin Open-drian
+- slew-rate: Integer. Controls Slew Rate
+ 0: Standard
+ 1: Slow
+- bias-disable: Bool. Pull disabled
+- bias-pull-down: Bool. Pull down on pin
+- bias-pull-up: Bool. Pull up on pin
+
+Examples:
+#include "imx7ulp-pinfunc.h"
+
+/* Pin Controller Node */
+iomuxc1: iomuxc@40ac0000 {
+ compatible = "fsl,imx7ulp-iomuxc1";
+ reg = <0x40ac0000 0x1000>;
+
+ /* Pin Configuration Node */
+ pinctrl_lpuart4: lpuart4grp {
+ pinmux = <
+ IMX7ULP_PAD_PTC3__LPUART4_RX
+ IMX7ULP_PAD_PTC2__LPUART4_TX
+ >;
+ bias-pull-up;
+ };
+};
diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-aspeed.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-aspeed.txt
index ca01710ee29a..3b7266c7c438 100644
--- a/Documentation/devicetree/bindings/pinctrl/pinctrl-aspeed.txt
+++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-aspeed.txt
@@ -69,8 +69,9 @@ PWM1 PWM2 PWM3 PWM4 PWM5 PWM6 PWM7 RGMII1 RGMII2 RMII1 RMII2 ROM16 ROM8 ROMCS1
ROMCS2 ROMCS3 ROMCS4 RXD1 RXD2 RXD3 RXD4 SALT1 SALT2 SALT3 SALT4 SD1 SD2 SGPMCK
SGPMI SGPMLD SGPMO SGPSCK SGPSI0 SGPSI1 SGPSLD SIOONCTRL SIOPBI SIOPBO SIOPWREQ
SIOPWRGD SIOS3 SIOS5 SIOSCI SPI1 SPI1DEBUG SPI1PASSTHRU SPICS1 TIMER3 TIMER4
-TIMER5 TIMER6 TIMER7 TIMER8 TXD1 TXD2 TXD3 TXD4 UART6 USBCKI VGABIOS_ROM VGAHS
-VGAVS VPI18 VPI24 VPI30 VPO12 VPO24 WDTRST1 WDTRST2
+TIMER5 TIMER6 TIMER7 TIMER8 TXD1 TXD2 TXD3 TXD4 UART6 USB11D1 USB11H2 USB2D1
+USB2H1 USBCKI VGABIOS_ROM VGAHS VGAVS VPI18 VPI24 VPI30 VPO12 VPO24 WDTRST1
+WDTRST2
aspeed,ast2500-pinctrl, aspeed,g5-pinctrl:
@@ -86,7 +87,8 @@ SALT11 SALT12 SALT13 SALT14 SALT2 SALT3 SALT4 SALT5 SALT6 SALT7 SALT8 SALT9
SCL1 SCL2 SD1 SD2 SDA1 SDA2 SGPS1 SGPS2 SIOONCTRL SIOPBI SIOPBO SIOPWREQ
SIOPWRGD SIOS3 SIOS5 SIOSCI SPI1 SPI1CS1 SPI1DEBUG SPI1PASSTHRU SPI2CK SPI2CS0
SPI2CS1 SPI2MISO SPI2MOSI TIMER3 TIMER4 TIMER5 TIMER6 TIMER7 TIMER8 TXD1 TXD2
-TXD3 TXD4 UART6 USBCKI VGABIOSROM VGAHS VGAVS VPI24 VPO WDTRST1 WDTRST2
+TXD3 TXD4 UART6 USB11BHID USB2AD USB2AH USB2BD USB2BH USBCKI VGABIOSROM VGAHS
+VGAVS VPI24 VPO WDTRST1 WDTRST2
Examples
========
diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt
index 62d0f33fa65e..4483cc31e531 100644
--- a/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt
+++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-bindings.txt
@@ -268,6 +268,8 @@ output-enable - enable output on a pin without actively driving it
(such as enabling an output buffer)
output-low - set the pin to output mode with low level
output-high - set the pin to output mode with high level
+sleep-hardware-state - indicate this is sleep related state which will be programmed
+ into the registers for the sleep state.
slew-rate - set the slew rate
For example:
diff --git a/Documentation/devicetree/bindings/pinctrl/pinctrl-mt65xx.txt b/Documentation/devicetree/bindings/pinctrl/pinctrl-mt65xx.txt
index 17631d0a9af7..37d744750579 100644
--- a/Documentation/devicetree/bindings/pinctrl/pinctrl-mt65xx.txt
+++ b/Documentation/devicetree/bindings/pinctrl/pinctrl-mt65xx.txt
@@ -5,6 +5,7 @@ The Mediatek's Pin controller is used to control SoC pins.
Required properties:
- compatible: value should be one of the following.
"mediatek,mt2701-pinctrl", compatible with mt2701 pinctrl.
+ "mediatek,mt2712-pinctrl", compatible with mt2712 pinctrl.
"mediatek,mt6397-pinctrl", compatible with mt6397 pinctrl.
"mediatek,mt7623-pinctrl", compatible with mt7623 pinctrl.
"mediatek,mt8127-pinctrl", compatible with mt8127 pinctrl.
diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,apq8064-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/qcom,apq8064-pinctrl.txt
index a7bde64798c7..a752a4716486 100644
--- a/Documentation/devicetree/bindings/pinctrl/qcom,apq8064-pinctrl.txt
+++ b/Documentation/devicetree/bindings/pinctrl/qcom,apq8064-pinctrl.txt
@@ -46,7 +46,8 @@ Valid values for pins are:
gpio0-gpio89
Valid values for function are:
- cam_mclk, codec_mic_i2s, codec_spkr_i2s, gpio, gsbi1, gsbi2, gsbi3, gsbi4,
+ cam_mclk, codec_mic_i2s, codec_spkr_i2s, gp_clk_0a, gp_clk_0b, gp_clk_1a,
+ gp_clk_1b, gp_clk_2a, gp_clk_2b, gpio, gsbi1, gsbi2, gsbi3, gsbi4,
gsbi4_cam_i2c, gsbi5, gsbi5_spi_cs1, gsbi5_spi_cs2, gsbi5_spi_cs3, gsbi6,
gsbi6_spi_cs1, gsbi6_spi_cs2, gsbi6_spi_cs3, gsbi7, gsbi7_spi_cs1,
gsbi7_spi_cs2, gsbi7_spi_cs3, gsbi_cam_i2c, hdmi, mi2s, riva_bt, riva_fm,
diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,ipq4019-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/qcom,ipq4019-pinctrl.txt
index cfb8500dd56b..93374f478b9e 100644
--- a/Documentation/devicetree/bindings/pinctrl/qcom,ipq4019-pinctrl.txt
+++ b/Documentation/devicetree/bindings/pinctrl/qcom,ipq4019-pinctrl.txt
@@ -50,7 +50,11 @@ Valid values for qcom,pins are:
Supports mux, bias and drive-strength
Valid values for qcom,function are:
-gpio, blsp_uart1, blsp_i2c0, blsp_i2c1, blsp_uart0, blsp_spi1, blsp_spi0
+aud_pin, audio_pwm, blsp_i2c0, blsp_i2c1, blsp_spi0, blsp_spi1, blsp_uart0,
+blsp_uart1, chip_rst, gpio, i2s_rx, i2s_spdif_in, i2s_spdif_out, i2s_td, i2s_tx,
+jtag, led0, led1, led2, led3, led4, led5, led6, led7, led8, led9, led10, led11,
+mdc, mdio, pcie, pmu, prng_rosc, qpic, rgmii, rmii, sdio, smart0, smart1,
+smart2, smart3, tm, wifi0, wifi1
Example:
diff --git a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt
index 8d893a874634..5b12c57e7f02 100644
--- a/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt
+++ b/Documentation/devicetree/bindings/pinctrl/qcom,pmic-gpio.txt
@@ -16,6 +16,7 @@ PMIC's from Qualcomm.
"qcom,pm8941-gpio"
"qcom,pm8994-gpio"
"qcom,pma8084-gpio"
+ "qcom,pmi8994-gpio"
And must contain either "qcom,spmi-gpio" or "qcom,ssbi-gpio"
if the device is on an spmi bus or an ssbi bus respectively
@@ -85,6 +86,7 @@ to specify in a pin configuration subnode:
gpio1-gpio36 for pm8941
gpio1-gpio22 for pm8994
gpio1-gpio22 for pma8084
+ gpio1-gpio10 for pmi8994
- function:
Usage: required
@@ -98,7 +100,10 @@ to specify in a pin configuration subnode:
"dtest1",
"dtest2",
"dtest3",
- "dtest4"
+ "dtest4",
+ And following values are supported by LV/MV GPIO subtypes:
+ "func3",
+ "func4"
- bias-disable:
Usage: optional
@@ -183,6 +188,25 @@ to specify in a pin configuration subnode:
Value type: <none>
Definition: The specified pins are configured in open-source mode.
+- qcom,analog-pass:
+ Usage: optional
+ Value type: <none>
+ Definition: The specified pins are configured in analog-pass-through mode.
+
+- qcom,atest:
+ Usage: optional
+ Value type: <u32>
+ Definition: Selects ATEST rail to route to GPIO when it's configured
+ in analog-pass-through mode.
+ Valid values are 1-4 corresponding to ATEST1 to ATEST4.
+
+- qcom,dtest-buffer:
+ Usage: optional
+ Value type: <u32>
+ Definition: Selects DTEST rail to route to GPIO when it's configured
+ as digital input.
+ Valid values are 1-4 corresponding to DTEST1 to DTEST4.
+
Example:
pm8921_gpio: gpio@150 {
diff --git a/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt
index 645082f03259..f4d127df980d 100644
--- a/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt
+++ b/Documentation/devicetree/bindings/pinctrl/renesas,pfc-pinctrl.txt
@@ -24,6 +24,7 @@ Required Properties:
- "renesas,pfc-r8a7794": for R8A7794 (R-Car E2) compatible pin-controller.
- "renesas,pfc-r8a7795": for R8A7795 (R-Car H3) compatible pin-controller.
- "renesas,pfc-r8a7796": for R8A7796 (R-Car M3-W) compatible pin-controller.
+ - "renesas,pfc-r8a77995": for R8A77995 (R-Car D3) compatible pin-controller.
- "renesas,pfc-sh73a0": for SH73A0 (SH-Mobile AG5) compatible pin-controller.
- reg: Base address and length of each memory resource used by the pin
diff --git a/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt
index ee01ab58224d..58b7921b4fed 100644
--- a/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt
+++ b/Documentation/devicetree/bindings/pinctrl/rockchip,pinctrl.txt
@@ -24,6 +24,7 @@ Required properties for iomux controller:
"rockchip,rk2928-pinctrl": for Rockchip RK2928
"rockchip,rk3066a-pinctrl": for Rockchip RK3066a
"rockchip,rk3066b-pinctrl": for Rockchip RK3066b
+ "rockchip,rk3128-pinctrl": for Rockchip RK3128
"rockchip,rk3188-pinctrl": for Rockchip RK3188
"rockchip,rk3228-pinctrl": for Rockchip RK3228
"rockchip,rk3288-pinctrl": for Rockchip RK3288
diff --git a/Documentation/devicetree/bindings/pinctrl/sprd,pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/sprd,pinctrl.txt
new file mode 100644
index 000000000000..b1cea7a3a071
--- /dev/null
+++ b/Documentation/devicetree/bindings/pinctrl/sprd,pinctrl.txt
@@ -0,0 +1,83 @@
+* Spreadtrum Pin Controller
+
+The Spreadtrum pin controller are organized in 3 blocks (types).
+
+The first block comprises some global control registers, and each
+register contains several bit fields with one bit or several bits
+to configure for some global common configuration, such as domain
+pad driving level, system control select and so on ("domain pad
+driving level": One pin can output 3.0v or 1.8v, depending on the
+related domain pad driving selection, if the related domain pad
+slect 3.0v, then the pin can output 3.0v. "system control" is used
+to choose one function (like: UART0) for which system, since we
+have several systems (AP/CP/CM4) on one SoC.).
+
+There are too much various configuration that we can not list all
+of them, so we can not make every Spreadtrum-special configuration
+as one generic configuration, and maybe it will add more strange
+global configuration in future. Then we add one "sprd,control" to
+set these various global control configuration, and we need use
+magic number for this property.
+
+Moreover we recognise every fields comprising one bit or several
+bits in one global control register as one pin, thus we should
+record every pin's bit offset, bit width and register offset to
+configure this field (pin).
+
+The second block comprises some common registers which have unified
+register definition, and each register described one pin is used
+to configure the pin sleep mode, function select and sleep related
+configuration.
+
+Now we have 4 systems for sleep mode on SC9860 SoC: AP system,
+PUBCP system, TGLDSP system and AGDSP system. And the pin sleep
+related configuration are:
+- input-enable
+- input-disable
+- output-high
+- output-low
+- bias-pull-up
+- bias-pull-down
+
+In some situation we need set the pin sleep mode and pin sleep related
+configuration, to set the pin sleep related configuration automatically
+by hardware when the system specified by sleep mode goes into deep
+sleep mode. For example, if we set the pin sleep mode as PUBCP_SLEEP
+and set the pin sleep related configuration as "input-enable", which
+means when PUBCP system goes into deep sleep mode, this pin will be set
+input enable automatically.
+
+Moreover we can not use the "sleep" state, since some systems (like:
+PUBCP system) do not run linux kernel OS (only AP system run linux
+kernel on SC9860 platform), then we can not select "sleep" state
+when the PUBCP system goes into deep sleep mode. Thus we introduce
+"sprd,sleep-mode" property to set pin sleep mode.
+
+The last block comprises some misc registers which also have unified
+register definition, and each register described one pin is used to
+configure drive strength, pull up/down and so on. Especially for pull
+up, we have two kind pull up resistor: 20K and 4.7K.
+
+Required properties for Spreadtrum pin controller:
+- compatible: "sprd,<soc>-pinctrl"
+ Please refer to each sprd,<soc>-pinctrl.txt binding doc for supported SoCs.
+- reg: The register address of pin controller device.
+- pins : An array of pin names.
+
+Optional properties:
+- function: Specified the function name.
+- drive-strength: Drive strength in mA.
+- input-schmitt-disable: Enable schmitt-trigger mode.
+- input-schmitt-enable: Disable schmitt-trigger mode.
+- bias-disable: Disable pin bias.
+- bias-pull-down: Pull down on pin.
+- bias-pull-up: Pull up on pin.
+- input-enable: Enable pin input.
+- input-disable: Enable pin output.
+- output-high: Set the pin as an output level high.
+- output-low: Set the pin as an output level low.
+- sleep-hardware-state: Indicate these configs in this state are sleep related.
+- sprd,control: Control values referring to databook for global control pins.
+- sprd,sleep-mode: Sleep mode selection.
+
+Please refer to each sprd,<soc>-pinctrl.txt binding doc for supported values.
diff --git a/Documentation/devicetree/bindings/pinctrl/sprd,sc9860-pinctrl.txt b/Documentation/devicetree/bindings/pinctrl/sprd,sc9860-pinctrl.txt
new file mode 100644
index 000000000000..5a628333d52f
--- /dev/null
+++ b/Documentation/devicetree/bindings/pinctrl/sprd,sc9860-pinctrl.txt
@@ -0,0 +1,70 @@
+* Spreadtrum SC9860 Pin Controller
+
+Please refer to sprd,pinctrl.txt in this directory for common binding part
+and usage.
+
+Required properties:
+- compatible: Must be "sprd,sc9860-pinctrl".
+- reg: The register address of pin controller device.
+- pins : An array of strings, each string containing the name of a pin.
+
+Optional properties:
+- function: A string containing the name of the function, values must be
+ one of: "func1", "func2", "func3" and "func4".
+- drive-strength: Drive strength in mA. Supported values: 2, 4, 6, 8, 10,
+ 12, 14, 16, 20, 21, 24, 25, 27, 29, 31 and 33.
+- input-schmitt-disable: Enable schmitt-trigger mode.
+- input-schmitt-enable: Disable schmitt-trigger mode.
+- bias-disable: Disable pin bias.
+- bias-pull-down: Pull down on pin.
+- bias-pull-up: Pull up on pin. Supported values: 20000 for pull-up resistor
+ is 20K and 4700 for pull-up resistor is 4.7K.
+- input-enable: Enable pin input.
+- input-disable: Enable pin output.
+- output-high: Set the pin as an output level high.
+- output-low: Set the pin as an output level low.
+- sleep-hardware-state: Indicate these configs in this state are sleep related.
+- sprd,control: Control values referring to databook for global control pins.
+- sprd,sleep-mode: Choose the pin sleep mode, and supported values are:
+ AP_SLEEP, PUBCP_SLEEP, TGLDSP_SLEEP and AGDSP_SLEEP.
+
+Pin sleep mode definition:
+enum pin_sleep_mode {
+ AP_SLEEP = BIT(0),
+ PUBCP_SLEEP = BIT(1),
+ TGLDSP_SLEEP = BIT(2),
+ AGDSP_SLEEP = BIT(3),
+};
+
+Example:
+pin_controller: pinctrl@402a0000 {
+ compatible = "sprd,sc9860-pinctrl";
+ reg = <0x402a0000 0x10000>;
+
+ grp1: sd0 {
+ pins = "SC9860_VIO_SD2_IRTE", "SC9860_VIO_SD0_IRTE";
+ sprd,control = <0x1>;
+ };
+
+ grp2: rfctl_33 {
+ pins = "SC9860_RFCTL33";
+ function = "func2";
+ sprd,sleep-mode = <AP_SLEEP | PUBCP_SLEEP>;
+ grp2_sleep_mode: rfctl_33_sleep {
+ pins = "SC9860_RFCTL33";
+ sleep-hardware-state;
+ output-low;
+ }
+ };
+
+ grp3: rfctl_misc_20 {
+ pins = "SC9860_RFCTL20_MISC";
+ drive-strength = <10>;
+ bias-pull-up = <4700>;
+ grp3_sleep_mode: rfctl_misc_sleep {
+ pins = "SC9860_RFCTL20_MISC";
+ sleep-hardware-state;
+ bias-pull-up;
+ }
+ };
+};
diff --git a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt
index 43c21fb04564..4a4766e9c254 100644
--- a/Documentation/devicetree/bindings/power/rockchip-io-domain.txt
+++ b/Documentation/devicetree/bindings/power/rockchip-io-domain.txt
@@ -39,6 +39,8 @@ Required properties:
- "rockchip,rk3368-pmu-io-voltage-domain" for rk3368 pmu-domains
- "rockchip,rk3399-io-voltage-domain" for rk3399
- "rockchip,rk3399-pmu-io-voltage-domain" for rk3399 pmu-domains
+ - "rockchip,rv1108-io-voltage-domain" for rv1108
+ - "rockchip,rv1108-pmu-io-voltage-domain" for rv1108 pmu-domains
Deprecated properties:
- rockchip,grf: phandle to the syscon managing the "general register files"
diff --git a/Documentation/devicetree/bindings/regulator/mt6311-regulator.txt b/Documentation/devicetree/bindings/regulator/mt6311-regulator.txt
index 02649d8b3f5a..84d544d8c1b1 100644
--- a/Documentation/devicetree/bindings/regulator/mt6311-regulator.txt
+++ b/Documentation/devicetree/bindings/regulator/mt6311-regulator.txt
@@ -1,4 +1,4 @@
-Mediatek MT6311 Regulator Driver
+Mediatek MT6311 Regulator
Required properties:
- compatible: "mediatek,mt6311-regulator"
diff --git a/Documentation/devicetree/bindings/regulator/mt6323-regulator.txt b/Documentation/devicetree/bindings/regulator/mt6323-regulator.txt
index c35d878b0960..a48749db4df3 100644
--- a/Documentation/devicetree/bindings/regulator/mt6323-regulator.txt
+++ b/Documentation/devicetree/bindings/regulator/mt6323-regulator.txt
@@ -1,4 +1,4 @@
-Mediatek MT6323 Regulator Driver
+Mediatek MT6323 Regulator
All voltage regulators are defined as subnodes of the regulators node. A list
of regulators provided by this controller are defined as subnodes of the
diff --git a/Documentation/devicetree/bindings/regulator/mt6380-regulator.txt b/Documentation/devicetree/bindings/regulator/mt6380-regulator.txt
new file mode 100644
index 000000000000..0058441f16d2
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/mt6380-regulator.txt
@@ -0,0 +1,89 @@
+MediaTek MT6380 Regulator
+
+All voltage regulators provided by the MT6380 PMIC are described as the
+subnodes of the MT6380 regulators node. Each regulator is named according
+to its regulator type, buck-<name> and ldo-<name>. The definition for each
+of these nodes is defined using the standard binding for regulators at
+Documentation/devicetree/bindings/regulator/regulator.txt.
+
+The valid names for regulators are:
+BUCK:
+ buck-core1, buck-vcore, buck-vrf
+LDO:
+ ldo-vm ,ldo-va , ldo-vphy, ldo-vddr, ldo-vt
+
+Example:
+
+ regulators {
+ compatible = "mediatek,mt6380-regulator";
+
+ mt6380_vcpu_reg: buck-vcore1 {
+ regulator-name = "vcore1";
+ regulator-min-microvolt = < 600000>;
+ regulator-max-microvolt = <1393750>;
+ regulator-ramp-delay = <6250>;
+ regulator-always-on;
+ regulator-boot-on;
+ };
+
+ mt6380_vcore_reg: buck-vcore {
+ regulator-name = "vcore";
+ regulator-min-microvolt = <600000>;
+ regulator-max-microvolt = <1393750>;
+ regulator-ramp-delay = <6250>;
+ };
+
+ mt6380_vrf_reg: buck-vrf {
+ regulator-name = "vrf";
+ regulator-min-microvolt = <1200000>;
+ regulator-max-microvolt = <1575000>;
+ regulator-ramp-delay = <0>;
+ regulator-always-on;
+ regulator-boot-on;
+ };
+
+ mt6380_vm_reg: ldo-vm {
+ regulator-name = "vm";
+ regulator-min-microvolt = <1050000>;
+ regulator-max-microvolt = <1400000>;
+ regulator-ramp-delay = <0>;
+ regulator-always-on;
+ regulator-boot-on;
+ };
+
+ mt6380_va_reg: ldo-va {
+ regulator-name = "va";
+ regulator-min-microvolt = <2200000>;
+ regulator-max-microvolt = <3300000>;
+ regulator-ramp-delay = <0>;
+ regulator-always-on;
+ regulator-boot-on;
+ };
+
+ mt6380_vphy_reg: ldo-vphy {
+ regulator-name = "vphy";
+ regulator-min-microvolt = <1800000>;
+ regulator-max-microvolt = <1800000>;
+ regulator-ramp-delay = <0>;
+ regulator-always-on;
+ regulator-boot-on;
+ };
+
+ mt6380_vddr_reg: ldo-vddr {
+ regulator-name = "vddr";
+ regulator-min-microvolt = <1240000>;
+ regulator-max-microvolt = <1840000>;
+ regulator-ramp-delay = <0>;
+ regulator-always-on;
+ regulator-boot-on;
+ };
+
+ mt6380_vt_reg: ldo-vt {
+ regulator-name = "vt";
+ regulator-min-microvolt = <2200000>;
+ regulator-max-microvolt = <3300000>;
+ regulator-ramp-delay = <0>;
+ regulator-always-on;
+ regulator-boot-on;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/regulator/mt6397-regulator.txt b/Documentation/devicetree/bindings/regulator/mt6397-regulator.txt
index a42b1d6e9863..01141fb00875 100644
--- a/Documentation/devicetree/bindings/regulator/mt6397-regulator.txt
+++ b/Documentation/devicetree/bindings/regulator/mt6397-regulator.txt
@@ -1,4 +1,4 @@
-Mediatek MT6397 Regulator Driver
+Mediatek MT6397 Regulator
Required properties:
- compatible: "mediatek,mt6397-regulator"
diff --git a/Documentation/devicetree/bindings/regulator/pwm-regulator.txt b/Documentation/devicetree/bindings/regulator/pwm-regulator.txt
index bf85aa9ad6a7..3d78d507e29f 100644
--- a/Documentation/devicetree/bindings/regulator/pwm-regulator.txt
+++ b/Documentation/devicetree/bindings/regulator/pwm-regulator.txt
@@ -71,7 +71,7 @@ Continuous Voltage With Enable GPIO Example:
* Inverted PWM logic, and the duty cycle range is limited
* to 30%-70%.
*/
- pwm-dutycycle-range <700 300>; /* */
+ pwm-dutycycle-range = <700 300>; /* */
};
Voltage Table Example:
diff --git a/Documentation/devicetree/bindings/regulator/st,stm32-vrefbuf.txt b/Documentation/devicetree/bindings/regulator/st,stm32-vrefbuf.txt
new file mode 100644
index 000000000000..3944ee3e731e
--- /dev/null
+++ b/Documentation/devicetree/bindings/regulator/st,stm32-vrefbuf.txt
@@ -0,0 +1,20 @@
+STM32 VREFBUF - Voltage reference buffer
+
+Some STM32 devices embed a voltage reference buffer which can be used as
+voltage reference for ADCs, DACs and also as voltage reference for external
+components through the dedicated VREF+ pin.
+
+Required properties:
+- compatible: Must be "st,stm32-vrefbuf".
+- reg: Offset and length of VREFBUF register set.
+- clocks: Must contain an entry for peripheral clock.
+
+Example:
+ vrefbuf: regulator@58003C00 {
+ compatible = "st,stm32-vrefbuf";
+ reg = <0x58003C00 0x8>;
+ clocks = <&rcc VREF_CK>;
+ regulator-min-microvolt = <1500000>;
+ regulator-max-microvolt = <2500000>;
+ vdda-supply = <&vdda>;
+ };
diff --git a/Documentation/devicetree/bindings/rng/imx-rngc.txt b/Documentation/devicetree/bindings/rng/imx-rngc.txt
new file mode 100644
index 000000000000..93c7174a7bed
--- /dev/null
+++ b/Documentation/devicetree/bindings/rng/imx-rngc.txt
@@ -0,0 +1,21 @@
+Freescale RNGC (Random Number Generator Version C)
+
+The driver also supports version B, which is mostly compatible
+to version C.
+
+Required properties:
+- compatible : should be one of
+ "fsl,imx25-rngb"
+ "fsl,imx35-rngc"
+- reg : offset and length of the register set of this block
+- interrupts : the interrupt number for the RNGC block
+- clocks : the RNGC clk source
+
+Example:
+
+rng@53fb0000 {
+ compatible = "fsl,imx25-rngb";
+ reg = <0x53fb0000 0x4000>;
+ interrupts = <22>;
+ clocks = <&trng_clk>;
+};
diff --git a/Documentation/devicetree/bindings/serial/8250.txt b/Documentation/devicetree/bindings/serial/8250.txt
index 419ff6c0a47f..dad3b2ec66d4 100644
--- a/Documentation/devicetree/bindings/serial/8250.txt
+++ b/Documentation/devicetree/bindings/serial/8250.txt
@@ -14,6 +14,8 @@ Required properties:
tegra132, or tegra210.
- "nxp,lpc3220-uart"
- "ralink,rt2880-uart"
+ - For MediaTek BTIF, must contain '"mediatek,<chip>-btif",
+ "mediatek,mtk-btif"' where <chip> is mt7622, mt7623.
- "altr,16550-FIFO32"
- "altr,16550-FIFO64"
- "altr,16550-FIFO128"
diff --git a/Documentation/devicetree/bindings/serial/renesas,sci-serial.txt b/Documentation/devicetree/bindings/serial/renesas,sci-serial.txt
index 8d27d1a603e7..4fc96946f81d 100644
--- a/Documentation/devicetree/bindings/serial/renesas,sci-serial.txt
+++ b/Documentation/devicetree/bindings/serial/renesas,sci-serial.txt
@@ -41,6 +41,8 @@ Required properties:
- "renesas,hscif-r8a7795" for R8A7795 (R-Car H3) HSCIF compatible UART.
- "renesas,scif-r8a7796" for R8A7796 (R-Car M3-W) SCIF compatible UART.
- "renesas,hscif-r8a7796" for R8A7796 (R-Car M3-W) HSCIF compatible UART.
+ - "renesas,scif-r8a77995" for R8A77995 (R-Car D3) SCIF compatible UART.
+ - "renesas,hscif-r8a77995" for R8A77995 (R-Car D3) HSCIF compatible UART.
- "renesas,scifa-sh73a0" for SH73A0 (SH-Mobile AG5) SCIFA compatible UART.
- "renesas,scifb-sh73a0" for SH73A0 (SH-Mobile AG5) SCIFB compatible UART.
- "renesas,rcar-gen1-scif" for R-Car Gen1 SCIF compatible UART,
diff --git a/Documentation/devicetree/bindings/serial/rs485.txt b/Documentation/devicetree/bindings/serial/rs485.txt
index 32b1fa1f2a5b..b8415936dfdb 100644
--- a/Documentation/devicetree/bindings/serial/rs485.txt
+++ b/Documentation/devicetree/bindings/serial/rs485.txt
@@ -5,14 +5,13 @@ the built-in half-duplex mode.
The properties described hereafter shall be given to a half-duplex capable
UART node.
-Required properties:
+Optional properties:
- rs485-rts-delay: prop-encoded-array <a b> where:
* a is the delay between rts signal and beginning of data sent in milliseconds.
it corresponds to the delay before sending data.
* b is the delay between end of data sent and rts signal in milliseconds
it corresponds to the delay after sending data and actual release of the line.
-
-Optional properties:
+ If this property is not specified, <0 0> is assumed.
- linux,rs485-enabled-at-boot-time: empty property telling to enable the rs485
feature at boot time. It can be disabled later with proper ioctl.
- rs485-rx-during-tx: empty property that enables the receiving of data even
diff --git a/Documentation/devicetree/bindings/serial/st,stm32-usart.txt b/Documentation/devicetree/bindings/serial/st,stm32-usart.txt
index 85ec5f2b1996..3657f9f9d17a 100644
--- a/Documentation/devicetree/bindings/serial/st,stm32-usart.txt
+++ b/Documentation/devicetree/bindings/serial/st,stm32-usart.txt
@@ -1,12 +1,19 @@
* STMicroelectronics STM32 USART
Required properties:
-- compatible: Can be either "st,stm32-usart", "st,stm32-uart",
-"st,stm32f7-usart" or "st,stm32f7-uart" depending on whether
-the device supports synchronous mode and is compatible with
-stm32(f4) or stm32f7.
+- compatible: can be either:
+ - "st,stm32-usart",
+ - "st,stm32-uart",
+ - "st,stm32f7-usart",
+ - "st,stm32f7-uart",
+ - "st,stm32h7-usart"
+ - "st,stm32h7-uart".
+ depending on whether the device supports synchronous mode
+ and is compatible with stm32(f4), stm32f7 or stm32h7.
- reg: The address and length of the peripheral registers space
-- interrupts: The interrupt line of the USART instance
+- interrupts:
+ - The interrupt line for the USART instance,
+ - An optional wake-up interrupt.
- clocks: The input clock of the USART instance
Optional properties:
diff --git a/Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt b/Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt
index 31b5b21598ff..5bf13960f7f4 100644
--- a/Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt
+++ b/Documentation/devicetree/bindings/spi/fsl-imx-cspi.txt
@@ -9,6 +9,7 @@ Required properties:
- "fsl,imx31-cspi" for SPI compatible with the one integrated on i.MX31
- "fsl,imx35-cspi" for SPI compatible with the one integrated on i.MX35
- "fsl,imx51-ecspi" for SPI compatible with the one integrated on i.MX51
+ - "fsl,imx53-ecspi" for SPI compatible with the one integrated on i.MX53 and later Soc
- reg : Offset and length of the register set for the device
- interrupts : Should contain CSPI/eCSPI interrupt
- cs-gpios : Specifies the gpio pins to be used for chipselects.
diff --git a/Documentation/devicetree/bindings/spi/sh-msiof.txt b/Documentation/devicetree/bindings/spi/sh-msiof.txt
index 64ee489571c4..39e5ef7c5e71 100644
--- a/Documentation/devicetree/bindings/spi/sh-msiof.txt
+++ b/Documentation/devicetree/bindings/spi/sh-msiof.txt
@@ -6,6 +6,7 @@ Required properties:
"renesas,msiof-r8a7792" (R-Car V2H)
"renesas,msiof-r8a7793" (R-Car M2-N)
"renesas,msiof-r8a7794" (R-Car E2)
+ "renesas,msiof-r8a7795" (R-Car H3)
"renesas,msiof-r8a7796" (R-Car M3-W)
"renesas,msiof-sh73a0" (SH-Mobile AG5)
"renesas,sh-mobile-msiof" (generic SH-Mobile compatibile device)
diff --git a/Documentation/devicetree/bindings/spi/spi-rockchip.txt b/Documentation/devicetree/bindings/spi/spi-rockchip.txt
index 83da4931d832..6e3ffacbba32 100644
--- a/Documentation/devicetree/bindings/spi/spi-rockchip.txt
+++ b/Documentation/devicetree/bindings/spi/spi-rockchip.txt
@@ -6,6 +6,7 @@ and display controllers using the SPI communication interface.
Required Properties:
- compatible: should be one of the following.
+ "rockchip,rv1108-spi" for rv1108 SoCs.
"rockchip,rk3036-spi" for rk3036 SoCS.
"rockchip,rk3066-spi" for rk3066 SoCs.
"rockchip,rk3188-spi" for rk3188 SoCs.
diff --git a/Documentation/devicetree/bindings/timer/nxp,tpm-timer.txt b/Documentation/devicetree/bindings/timer/nxp,tpm-timer.txt
new file mode 100644
index 000000000000..b4aa7ddb5b13
--- /dev/null
+++ b/Documentation/devicetree/bindings/timer/nxp,tpm-timer.txt
@@ -0,0 +1,28 @@
+NXP Low Power Timer/Pulse Width Modulation Module (TPM)
+
+The Timer/PWM Module (TPM) supports input capture, output compare,
+and the generation of PWM signals to control electric motor and power
+management applications. The counter, compare and capture registers
+are clocked by an asynchronous clock that can remain enabled in low
+power modes. TPM can support global counter bus where one TPM drives
+the counter bus for the others, provided bit width is the same.
+
+Required properties:
+
+- compatible : should be "fsl,imx7ulp-tpm"
+- reg : Specifies base physical address and size of the register sets
+ for the clock event device and clock source device.
+- interrupts : Should be the clock event device interrupt.
+- clocks : The clocks provided by the SoC to drive the timer, must contain
+ an entry for each entry in clock-names.
+- clock-names : Must include the following entries: "igp" and "per".
+
+Example:
+tpm5: tpm@40260000 {
+ compatible = "fsl,imx7ulp-tpm";
+ reg = <0x40260000 0x1000>;
+ interrupts = <GIC_SPI 22 IRQ_TYPE_LEVEL_HIGH>;
+ clocks = <&clks IMX7ULP_CLK_NIC1_BUS_DIV>,
+ <&clks IMX7ULP_CLK_LPTPM5>;
+ clock-names = "ipg", "per";
+};
diff --git a/Documentation/devicetree/bindings/timer/renesas,cmt.txt b/Documentation/devicetree/bindings/timer/renesas,cmt.txt
index 1a05c1b243c1..6ca6b9e582a0 100644
--- a/Documentation/devicetree/bindings/timer/renesas,cmt.txt
+++ b/Documentation/devicetree/bindings/timer/renesas,cmt.txt
@@ -12,46 +12,29 @@ datasheets.
Required Properties:
- compatible: must contain one or more of the following:
- - "renesas,cmt-32-r8a7740" for the r8a7740 32-bit CMT
- (CMT0)
- - "renesas,cmt-32-sh7372" for the sh7372 32-bit CMT
- (CMT0)
- - "renesas,cmt-32-sh73a0" for the sh73a0 32-bit CMT
- (CMT0)
- - "renesas,cmt-32" for all 32-bit CMT without fast clock support
- (CMT0 on sh7372, sh73a0 and r8a7740)
- This is a fallback for the above renesas,cmt-32-* entries.
-
- - "renesas,cmt-32-fast-r8a7740" for the r8a7740 32-bit CMT with fast
- clock support (CMT[234])
- - "renesas,cmt-32-fast-sh7372" for the sh7372 32-bit CMT with fast
- clock support (CMT[234])
- - "renesas,cmt-32-fast-sh73a0" for the sh73A0 32-bit CMT with fast
- clock support (CMT[234])
- - "renesas,cmt-32-fast" for all 32-bit CMT with fast clock support
- (CMT[234] on sh7372, sh73a0 and r8a7740)
- This is a fallback for the above renesas,cmt-32-fast-* entries.
-
- - "renesas,cmt-48-sh7372" for the sh7372 48-bit CMT
- (CMT1)
- "renesas,cmt-48-sh73a0" for the sh73A0 48-bit CMT
(CMT1)
- "renesas,cmt-48-r8a7740" for the r8a7740 48-bit CMT
(CMT1)
- "renesas,cmt-48" for all non-second generation 48-bit CMT
- (CMT1 on sh7372, sh73a0 and r8a7740)
+ (CMT1 on sh73a0 and r8a7740)
This is a fallback for the above renesas,cmt-48-* entries.
- - "renesas,cmt-48-r8a73a4" for the r8a73a4 48-bit CMT
- (CMT[01])
- - "renesas,cmt-48-r8a7790" for the r8a7790 48-bit CMT
- (CMT[01])
- - "renesas,cmt-48-r8a7791" for the r8a7791 48-bit CMT
- (CMT[01])
- - "renesas,cmt-48-gen2" for all second generation 48-bit CMT
- (CMT[01] on r8a73a4, r8a7790 and r8a7791)
- This is a fallback for the renesas,cmt-48-r8a73a4,
- renesas,cmt-48-r8a7790 and renesas,cmt-48-r8a7791 entries.
+ - "renesas,cmt0-r8a73a4" for the 32-bit CMT0 device included in r8a73a4.
+ - "renesas,cmt1-r8a73a4" for the 48-bit CMT1 device included in r8a73a4.
+ - "renesas,cmt0-r8a7790" for the 32-bit CMT0 device included in r8a7790.
+ - "renesas,cmt1-r8a7790" for the 48-bit CMT1 device included in r8a7790.
+ - "renesas,cmt0-r8a7791" for the 32-bit CMT0 device included in r8a7791.
+ - "renesas,cmt1-r8a7791" for the 48-bit CMT1 device included in r8a7791.
+ - "renesas,cmt0-r8a7793" for the 32-bit CMT0 device included in r8a7793.
+ - "renesas,cmt1-r8a7793" for the 48-bit CMT1 device included in r8a7793.
+ - "renesas,cmt0-r8a7794" for the 32-bit CMT0 device included in r8a7794.
+ - "renesas,cmt1-r8a7794" for the 48-bit CMT1 device included in r8a7794.
+
+ - "renesas,rcar-gen2-cmt0" for 32-bit CMT0 devices included in R-Car Gen2.
+ - "renesas,rcar-gen2-cmt1" for 48-bit CMT1 devices included in R-Car Gen2.
+ These are fallbacks for r8a73a4 and all the R-Car Gen2
+ entries listed above.
- reg: base address and length of the registers block for the timer module.
- interrupts: interrupt-specifier for the timer, one per channel.
@@ -59,21 +42,29 @@ Required Properties:
in clock-names.
- clock-names: must contain "fck" for the functional clock.
- - renesas,channels-mask: bitmask of the available channels.
-
-Example: R8A7790 (R-Car H2) CMT0 node
-
- CMT0 on R8A7790 implements hardware channels 5 and 6 only and names
- them channels 0 and 1 in the documentation.
+Example: R8A7790 (R-Car H2) CMT0 and CMT1 nodes
cmt0: timer@ffca0000 {
- compatible = "renesas,cmt-48-r8a7790", "renesas,cmt-48-gen2";
+ compatible = "renesas,cmt0-r8a7790", "renesas,rcar-gen2-cmt0";
reg = <0 0xffca0000 0 0x1004>;
interrupts = <0 142 IRQ_TYPE_LEVEL_HIGH>,
<0 142 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&mstp1_clks R8A7790_CLK_CMT0>;
clock-names = "fck";
+ };
- renesas,channels-mask = <0x60>;
+ cmt1: timer@e6130000 {
+ compatible = "renesas,cmt1-r8a7790", "renesas,rcar-gen2-cmt1";
+ reg = <0 0xe6130000 0 0x1004>;
+ interrupts = <0 120 IRQ_TYPE_LEVEL_HIGH>,
+ <0 121 IRQ_TYPE_LEVEL_HIGH>,
+ <0 122 IRQ_TYPE_LEVEL_HIGH>,
+ <0 123 IRQ_TYPE_LEVEL_HIGH>,
+ <0 124 IRQ_TYPE_LEVEL_HIGH>,
+ <0 125 IRQ_TYPE_LEVEL_HIGH>,
+ <0 126 IRQ_TYPE_LEVEL_HIGH>,
+ <0 127 IRQ_TYPE_LEVEL_HIGH>;
+ clocks = <&mstp3_clks R8A7790_CLK_CMT1>;
+ clock-names = "fck";
};
diff --git a/Documentation/devicetree/bindings/usb/brcm,bdc.txt b/Documentation/devicetree/bindings/usb/brcm,bdc.txt
new file mode 100644
index 000000000000..63e63af3bf59
--- /dev/null
+++ b/Documentation/devicetree/bindings/usb/brcm,bdc.txt
@@ -0,0 +1,29 @@
+Broadcom USB Device Controller (BDC)
+====================================
+
+Required properties:
+
+- compatible: must be one of:
+ "brcm,bdc-v0.16"
+ "brcm,bdc"
+- reg: the base register address and length
+- interrupts: the interrupt line for this controller
+
+Optional properties:
+
+On Broadcom STB platforms, these properties are required:
+
+- phys: phandle to one or two USB PHY blocks
+ NOTE: Some SoC's have a single phy and some have
+ USB 2.0 and USB 3.0 phys
+- clocks: phandle to the functional clock of this block
+
+Example:
+
+ bdc@f0b02000 {
+ compatible = "brcm,bdc-v0.16";
+ reg = <0xf0b02000 0xfc4>;
+ interrupts = <0x0 0x60 0x0>;
+ phys = <&usbphy_0 0x0>;
+ clocks = <&sw_usbd>;
+ };
diff --git a/Documentation/devicetree/bindings/usb/fcs,fusb302.txt b/Documentation/devicetree/bindings/usb/fcs,fusb302.txt
new file mode 100644
index 000000000000..472facfa5a71
--- /dev/null
+++ b/Documentation/devicetree/bindings/usb/fcs,fusb302.txt
@@ -0,0 +1,29 @@
+Fairchild FUSB302 Type-C Port controllers
+
+Required properties :
+- compatible : "fcs,fusb302"
+- reg : I2C slave address
+- interrupts : Interrupt specifier
+
+Optional properties :
+- fcs,max-sink-microvolt : Maximum voltage to negotiate when configured as sink
+- fcs,max-sink-microamp : Maximum current to negotiate when configured as sink
+- fcs,max-sink-microwatt : Maximum power to negotiate when configured as sink
+ If this is less then max-sink-microvolt *
+ max-sink-microamp then the configured current will
+ be clamped.
+- fcs,operating-sink-microwatt :
+ Minimum amount of power accepted from a sink
+ when negotiating
+
+Example:
+
+fusb302: typec-portc@54 {
+ compatible = "fcs,fusb302";
+ reg = <0x54>;
+ interrupt-parent = <&nmi_intc>;
+ interrupts = <0 IRQ_TYPE_LEVEL_LOW>;
+ fcs,max-sink-microvolt = <12000000>;
+ fcs,max-sink-microamp = <3000000>;
+ fcs,max-sink-microwatt = <36000000>;
+};
diff --git a/Documentation/devicetree/bindings/usb/keystone-usb.txt b/Documentation/devicetree/bindings/usb/keystone-usb.txt
index 60527d335b58..2d1bef16f149 100644
--- a/Documentation/devicetree/bindings/usb/keystone-usb.txt
+++ b/Documentation/devicetree/bindings/usb/keystone-usb.txt
@@ -12,8 +12,21 @@ Required properties:
MPU.
- ranges: allows valid 1:1 translation between child's address space and
parent's address space.
- - clocks: Clock IDs array as required by the controller.
- - clock-names: names of clocks correseponding to IDs in the clock property.
+
+SoC-specific Required Properties:
+The following are mandatory properties for Keystone 2 66AK2HK, 66AK2L and 66AK2E
+SoCs only:
+
+- clocks: Clock ID for USB functional clock.
+- clock-names: Must be "usb".
+
+
+The following are mandatory properties for Keystone 2 66AK2G SoCs only:
+
+- power-domains: Should contain a phandle to a PM domain provider node
+ and an args specifier containing the USB device id
+ value. This property is as per the binding,
+ Documentation/devicetree/bindings/soc/ti/sci-pm-domain.txt
Sub-nodes:
The dwc3 core should be added as subnode to Keystone DWC3 glue.
diff --git a/Documentation/devicetree/bindings/usb/mt8173-xhci.txt b/Documentation/devicetree/bindings/usb/mediatek,mtk-xhci.txt
index 0acfc8acbea1..5611a2e4ddf0 100644
--- a/Documentation/devicetree/bindings/usb/mt8173-xhci.txt
+++ b/Documentation/devicetree/bindings/usb/mediatek,mtk-xhci.txt
@@ -11,7 +11,11 @@ into two parts.
------------------------------------------------------------------------
Required properties:
- - compatible : should contain "mediatek,mt8173-xhci"
+ - compatible : should be "mediatek,<soc-model>-xhci", "mediatek,mtk-xhci",
+ soc-model is the name of SoC, such as mt8173, mt2712 etc, when using
+ "mediatek,mtk-xhci" compatible string, you need SoC specific ones in
+ addition, one of:
+ - "mediatek,mt8173-xhci"
- reg : specifies physical base address and size of the registers
- reg-names: should be "mac" for xHCI MAC and "ippc" for IP port control
- interrupts : interrupt used by the controller
@@ -68,10 +72,14 @@ usb30: usb@11270000 {
In the case, xhci is added as subnode to mtu3. An example and the DT binding
details of mtu3 can be found in:
-Documentation/devicetree/bindings/usb/mt8173-mtu3.txt
+Documentation/devicetree/bindings/usb/mediatek,mtu3.txt
Required properties:
- - compatible : should contain "mediatek,mt8173-xhci"
+ - compatible : should be "mediatek,<soc-model>-xhci", "mediatek,mtk-xhci",
+ soc-model is the name of SoC, such as mt8173, mt2712 etc, when using
+ "mediatek,mtk-xhci" compatible string, you need SoC specific ones in
+ addition, one of:
+ - "mediatek,mt8173-xhci"
- reg : specifies physical base address and size of the registers
- reg-names: should be "mac" for xHCI MAC
- interrupts : interrupt used by the host controller
diff --git a/Documentation/devicetree/bindings/usb/mt8173-mtu3.txt b/Documentation/devicetree/bindings/usb/mediatek,mtu3.txt
index 1d7c3bc677f7..838ae48eafc1 100644
--- a/Documentation/devicetree/bindings/usb/mt8173-mtu3.txt
+++ b/Documentation/devicetree/bindings/usb/mediatek,mtu3.txt
@@ -1,7 +1,11 @@
The device node for Mediatek USB3.0 DRD controller
Required properties:
- - compatible : should be "mediatek,mt8173-mtu3"
+ - compatible : should be "mediatek,<soc-model>-mtu3", "mediatek,mtu3",
+ soc-model is the name of SoC, such as mt8173, mt2712 etc,
+ when using "mediatek,mtu3" compatible string, you need SoC specific
+ ones in addition, one of:
+ - "mediatek,mt8173-mtu3"
- reg : specifies physical base address and size of the registers
- reg-names: should be "mac" for device IP and "ippc" for IP port control
- interrupts : interrupt used by the device IP
@@ -44,7 +48,7 @@ Optional properties:
Sub-nodes:
The xhci should be added as subnode to mtu3 as shown in the following example
if host mode is enabled. The DT binding details of xhci can be found in:
-Documentation/devicetree/bindings/usb/mt8173-xhci.txt
+Documentation/devicetree/bindings/usb/mediatek,mtk-xhci.txt
Example:
ssusb: usb@11271000 {
diff --git a/Documentation/devicetree/bindings/usb/renesas_usb3.txt b/Documentation/devicetree/bindings/usb/renesas_usb3.txt
index 8d52766f07b9..e28025883b79 100644
--- a/Documentation/devicetree/bindings/usb/renesas_usb3.txt
+++ b/Documentation/devicetree/bindings/usb/renesas_usb3.txt
@@ -3,20 +3,30 @@ Renesas Electronics USB3.0 Peripheral driver
Required properties:
- compatible: Must contain one of the following:
- "renesas,r8a7795-usb3-peri"
+ - "renesas,r8a7796-usb3-peri"
+ - "renesas,rcar-gen3-usb3-peri" for a generic R-Car Gen3 compatible
+ device
+
+ When compatible with the generic version, nodes must list the
+ SoC-specific version corresponding to the platform first
+ followed by the generic version.
+
- reg: Base address and length of the register for the USB3.0 Peripheral
- interrupts: Interrupt specifier for the USB3.0 Peripheral
- clocks: clock phandle and specifier pair
-Example:
+Example of R-Car H3 ES1.x:
usb3_peri0: usb@ee020000 {
- compatible = "renesas,r8a7795-usb3-peri";
+ compatible = "renesas,r8a7795-usb3-peri",
+ "renesas,rcar-gen3-usb3-peri";
reg = <0 0xee020000 0 0x400>;
interrupts = <GIC_SPI 104 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&cpg CPG_MOD 328>;
};
usb3_peri1: usb@ee060000 {
- compatible = "renesas,r8a7795-usb3-peri";
+ compatible = "renesas,r8a7795-usb3-peri",
+ "renesas,rcar-gen3-usb3-peri";
reg = <0 0xee060000 0 0x400>;
interrupts = <GIC_SPI 100 IRQ_TYPE_LEVEL_HIGH>;
clocks = <&cpg CPG_MOD 327>;
diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt
index daf465bef758..4e72012928b4 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.txt
+++ b/Documentation/devicetree/bindings/vendor-prefixes.txt
@@ -175,6 +175,7 @@ kosagi Sutajio Ko-Usagi PTE Ltd.
kyo Kyocera Corporation
lacie LaCie
lantiq Lantiq Semiconductor
+lattice Lattice Semiconductor
lego LEGO Systems A/S
lenovo Lenovo Group Ltd.
lg LG Corporation
@@ -249,6 +250,7 @@ oxsemi Oxford Semiconductor, Ltd.
panasonic Panasonic Corporation
parade Parade Technologies Inc.
pericom Pericom Technology Inc.
+pervasive Pervasive Displays, Inc.
phytec PHYTEC Messtechnik GmbH
picochip Picochip Ltd
pine64 Pine64
diff --git a/Documentation/devicetree/bindings/xilinx.txt b/Documentation/devicetree/bindings/xilinx.txt
index 299d0923537b..1d11b9002ef8 100644
--- a/Documentation/devicetree/bindings/xilinx.txt
+++ b/Documentation/devicetree/bindings/xilinx.txt
@@ -281,6 +281,8 @@
capabilities of the underlying ICAP hardware
differ between different families. May be
'virtex2p', 'virtex4', or 'virtex5'.
+ - compatible : should contain "xlnx,xps-hwicap-1.00.a" or
+ "xlnx,opb-hwicap-1.00.b".
vi) Xilinx Uart 16550
diff --git a/Documentation/doc-guide/sphinx.rst b/Documentation/doc-guide/sphinx.rst
index 84e8e8a9cbdb..a2417633fdd8 100644
--- a/Documentation/doc-guide/sphinx.rst
+++ b/Documentation/doc-guide/sphinx.rst
@@ -19,6 +19,110 @@ Finally, there are thousands of plain text documentation files scattered around
``Documentation``. Some of these will likely be converted to reStructuredText
over time, but the bulk of them will remain in plain text.
+.. _sphinx_install:
+
+Sphinx Install
+==============
+
+The ReST markups currently used by the Documentation/ files are meant to be
+built with ``Sphinx`` version 1.3 or upper. If you're desiring to build
+PDF outputs, it is recommended to use version 1.4.6 or upper.
+
+There's a script that checks for the Spinx requirements. Please see
+:ref:`sphinx-pre-install` for further details.
+
+Most distributions are shipped with Sphinx, but its toolchain is fragile,
+and it is not uncommon that upgrading it or some other Python packages
+on your machine would cause the documentation build to break.
+
+A way to get rid of that is to use a different version than the one shipped
+on your distributions. In order to do that, it is recommended to install
+Sphinx inside a virtual environment, using ``virtualenv-3``
+or ``virtualenv``, depending on how your distribution packaged Python 3.
+
+.. note::
+
+ #) Sphinx versions below 1.5 don't work properly with Python's
+ docutils version 0.13.1 or upper. So, if you're willing to use
+ those versions, you should run ``pip install 'docutils==0.12'``.
+
+ #) It is recommended to use the RTD theme for html output. Depending
+ on the Sphinx version, it should be installed in separate,
+ with ``pip install sphinx_rtd_theme``.
+
+ #) Some ReST pages contain math expressions. Due to the way Sphinx work,
+ those expressions are written using LaTeX notation. It needs texlive
+ installed with amdfonts and amsmath in order to evaluate them.
+
+In summary, if you want to install Sphinx version 1.4.9, you should do::
+
+ $ virtualenv sphinx_1.4
+ $ . sphinx_1.4/bin/activate
+ (sphinx_1.4) $ pip install -r Documentation/sphinx/requirements.txt
+
+After running ``. sphinx_1.4/bin/activate``, the prompt will change,
+in order to indicate that you're using the new environment. If you
+open a new shell, you need to rerun this command to enter again at
+the virtual environment before building the documentation.
+
+Image output
+------------
+
+The kernel documentation build system contains an extension that
+handles images on both GraphViz and SVG formats (see
+:ref:`sphinx_kfigure`).
+
+For it to work, you need to install both GraphViz and ImageMagick
+packages. If those packages are not installed, the build system will
+still build the documentation, but won't include any images at the
+output.
+
+PDF and LaTeX builds
+--------------------
+
+Such builds are currently supported only with Sphinx versions 1.4 and upper.
+
+For PDF and LaTeX output, you'll also need ``XeLaTeX`` version 3.14159265.
+
+Depending on the distribution, you may also need to install a series of
+``texlive`` packages that provide the minimal set of functionalities
+required for ``XeLaTeX`` to work.
+
+.. _sphinx-pre-install:
+
+Checking for Sphinx dependencies
+--------------------------------
+
+There's a script that automatically check for Sphinx dependencies. If it can
+recognize your distribution, it will also give a hint about the install
+command line options for your distro::
+
+ $ ./scripts/sphinx-pre-install
+ Checking if the needed tools for Fedora release 26 (Twenty Six) are available
+ Warning: better to also install "texlive-luatex85".
+ You should run:
+
+ sudo dnf install -y texlive-luatex85
+ /usr/bin/virtualenv sphinx_1.4
+ . sphinx_1.4/bin/activate
+ pip install -r Documentation/sphinx/requirements.txt
+
+ Can't build as 1 mandatory dependency is missing at ./scripts/sphinx-pre-install line 468.
+
+By default, it checks all the requirements for both html and PDF, including
+the requirements for images, math expressions and LaTeX build, and assumes
+that a virtual Python environment will be used. The ones needed for html
+builds are assumed to be mandatory; the others to be optional.
+
+It supports two optional parameters:
+
+``--no-pdf``
+ Disable checks for PDF;
+
+``--no-virtualenv``
+ Use OS packaging for Sphinx instead of Python virtual environment.
+
+
Sphinx Build
============
@@ -118,7 +222,7 @@ Here are some specific guidelines for the kernel documentation:
the C domain
------------
-The `Sphinx C Domain`_ (name c) is suited for documentation of C API. E.g. a
+The **Sphinx C Domain** (name c) is suited for documentation of C API. E.g. a
function prototype:
.. code-block:: rst
@@ -229,6 +333,7 @@ Rendered as:
- column 3
+.. _sphinx_kfigure:
Figures & Images
================
diff --git a/Documentation/driver-api/basics.rst b/Documentation/driver-api/basics.rst
index ab82250c7727..73fa7d42bbba 100644
--- a/Documentation/driver-api/basics.rst
+++ b/Documentation/driver-api/basics.rst
@@ -4,7 +4,7 @@ Driver Basics
Driver Entry and Exit points
----------------------------
-.. kernel-doc:: include/linux/init.h
+.. kernel-doc:: include/linux/module.h
:internal:
Driver device table
@@ -103,9 +103,6 @@ Kernel utility functions
.. kernel-doc:: kernel/panic.c
:export:
-.. kernel-doc:: kernel/sys.c
- :export:
-
.. kernel-doc:: kernel/rcu/tree.c
:export:
diff --git a/Documentation/driver-api/dma-buf.rst b/Documentation/driver-api/dma-buf.rst
index 31671b469627..dc384f2f7f34 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -139,9 +139,6 @@ DMA Fences
Seqno Hardware Fences
~~~~~~~~~~~~~~~~~~~~~
-.. kernel-doc:: drivers/dma-buf/seqno-fence.c
- :export:
-
.. kernel-doc:: include/linux/seqno-fence.h
:internal:
diff --git a/Documentation/driver-api/gpio.rst b/Documentation/driver-api/gpio.rst
new file mode 100644
index 000000000000..6dd4aa647f27
--- /dev/null
+++ b/Documentation/driver-api/gpio.rst
@@ -0,0 +1,45 @@
+===================================
+General Purpose Input/Output (GPIO)
+===================================
+
+Core
+====
+
+.. kernel-doc:: include/linux/gpio/driver.h
+ :internal:
+
+.. kernel-doc:: drivers/gpio/gpiolib.c
+ :export:
+
+Legacy API
+==========
+
+The functions listed in this section are deprecated. The GPIO descriptor based
+API described above should be used in new code.
+
+.. kernel-doc:: drivers/gpio/gpiolib-legacy.c
+ :export:
+
+ACPI support
+============
+
+.. kernel-doc:: drivers/gpio/gpiolib-acpi.c
+ :export:
+
+Device tree support
+===================
+
+.. kernel-doc:: drivers/gpio/gpiolib-of.c
+ :export:
+
+Device-managed API
+==================
+
+.. kernel-doc:: drivers/gpio/devres.c
+ :export:
+
+sysfs helpers
+=============
+
+.. kernel-doc:: drivers/gpio/gpiolib-sysfs.c
+ :export:
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst
index 7c94ab50afed..9c20624842b7 100644
--- a/Documentation/driver-api/index.rst
+++ b/Documentation/driver-api/index.rst
@@ -44,6 +44,7 @@ available subsections can be seen below.
uio-howto
firmware/index
pinctl
+ gpio
misc_devices
.. only:: subproject and html
diff --git a/Documentation/driver-api/miscellaneous.rst b/Documentation/driver-api/miscellaneous.rst
index 8da7d115bafc..304ffb146cf9 100644
--- a/Documentation/driver-api/miscellaneous.rst
+++ b/Documentation/driver-api/miscellaneous.rst
@@ -47,4 +47,3 @@ used by one consumer at a time.
.. kernel-doc:: drivers/pwm/core.c
:export:
-
diff --git a/Documentation/driver-api/s390-drivers.rst b/Documentation/driver-api/s390-drivers.rst
index 7060da136095..ecf8851d3565 100644
--- a/Documentation/driver-api/s390-drivers.rst
+++ b/Documentation/driver-api/s390-drivers.rst
@@ -75,7 +75,7 @@ The channel-measurement facility provides a means to collect measurement
data which is made available by the channel subsystem for each channel
attached device.
-.. kernel-doc:: arch/s390/include/asm/cmb.h
+.. kernel-doc:: arch/s390/include/uapi/asm/cmb.h
:internal:
.. kernel-doc:: drivers/s390/cio/cmf.c
diff --git a/Documentation/driver-api/scsi.rst b/Documentation/driver-api/scsi.rst
index 859fb672319f..5a2aa7a377d9 100644
--- a/Documentation/driver-api/scsi.rst
+++ b/Documentation/driver-api/scsi.rst
@@ -224,14 +224,6 @@ mid to lowlevel SCSI driver interface
.. kernel-doc:: drivers/scsi/hosts.c
:export:
-drivers/scsi/constants.c
-~~~~~~~~~~~~~~~~~~~~~~~~
-
-mid to lowlevel SCSI driver interface
-
-.. kernel-doc:: drivers/scsi/constants.c
- :export:
-
Transport classes
-----------------
diff --git a/Documentation/driver-model/devres.txt b/Documentation/driver-model/devres.txt
index 30e04f7a690d..69f08c0f23a8 100644
--- a/Documentation/driver-model/devres.txt
+++ b/Documentation/driver-model/devres.txt
@@ -312,6 +312,7 @@ IRQ
devm_irq_alloc_descs_from()
devm_irq_alloc_generic_chip()
devm_irq_setup_generic_chip()
+ devm_irq_sim_init()
LED
devm_led_classdev_register()
diff --git a/Documentation/errseq.rst b/Documentation/errseq.rst
new file mode 100644
index 000000000000..4c29bd5afbc5
--- /dev/null
+++ b/Documentation/errseq.rst
@@ -0,0 +1,149 @@
+The errseq_t datatype
+=====================
+An errseq_t is a way of recording errors in one place, and allowing any
+number of "subscribers" to tell whether it has changed since a previous
+point where it was sampled.
+
+The initial use case for this is tracking errors for file
+synchronization syscalls (fsync, fdatasync, msync and sync_file_range),
+but it may be usable in other situations.
+
+It's implemented as an unsigned 32-bit value. The low order bits are
+designated to hold an error code (between 1 and MAX_ERRNO). The upper bits
+are used as a counter. This is done with atomics instead of locking so that
+these functions can be called from any context.
+
+Note that there is a risk of collisions if new errors are being recorded
+frequently, since we have so few bits to use as a counter.
+
+To mitigate this, the bit between the error value and counter is used as
+a flag to tell whether the value has been sampled since a new value was
+recorded. That allows us to avoid bumping the counter if no one has
+sampled it since the last time an error was recorded.
+
+Thus we end up with a value that looks something like this::
+
+ bit: 31..13 12 11..0
+ +-----------------+----+----------------+
+ | counter | SF | errno |
+ +-----------------+----+----------------+
+
+The general idea is for "watchers" to sample an errseq_t value and keep
+it as a running cursor. That value can later be used to tell whether
+any new errors have occurred since that sampling was done, and atomically
+record the state at the time that it was checked. This allows us to
+record errors in one place, and then have a number of "watchers" that
+can tell whether the value has changed since they last checked it.
+
+A new errseq_t should always be zeroed out. An errseq_t value of all zeroes
+is the special (but common) case where there has never been an error. An all
+zero value thus serves as the "epoch" if one wishes to know whether there
+has ever been an error set since it was first initialized.
+
+API usage
+=========
+Let me tell you a story about a worker drone. Now, he's a good worker
+overall, but the company is a little...management heavy. He has to
+report to 77 supervisors today, and tomorrow the "big boss" is coming in
+from out of town and he's sure to test the poor fellow too.
+
+They're all handing him work to do -- so much he can't keep track of who
+handed him what, but that's not really a big problem. The supervisors
+just want to know when he's finished all of the work they've handed him so
+far and whether he made any mistakes since they last asked.
+
+He might have made the mistake on work they didn't actually hand him,
+but he can't keep track of things at that level of detail, all he can
+remember is the most recent mistake that he made.
+
+Here's our worker_drone representation::
+
+ struct worker_drone {
+ errseq_t wd_err; /* for recording errors */
+ };
+
+Every day, the worker_drone starts out with a blank slate::
+
+ struct worker_drone wd;
+
+ wd.wd_err = (errseq_t)0;
+
+The supervisors come in and get an initial read for the day. They
+don't care about anything that happened before their watch begins::
+
+ struct supervisor {
+ errseq_t s_wd_err; /* private "cursor" for wd_err */
+ spinlock_t s_wd_err_lock; /* protects s_wd_err */
+ }
+
+ struct supervisor su;
+
+ su.s_wd_err = errseq_sample(&wd.wd_err);
+ spin_lock_init(&su.s_wd_err_lock);
+
+Now they start handing him tasks to do. Every few minutes they ask him to
+finish up all of the work they've handed him so far. Then they ask him
+whether he made any mistakes on any of it::
+
+ spin_lock(&su.su_wd_err_lock);
+ err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err);
+ spin_unlock(&su.su_wd_err_lock);
+
+Up to this point, that just keeps returning 0.
+
+Now, the owners of this company are quite miserly and have given him
+substandard equipment with which to do his job. Occasionally it
+glitches and he makes a mistake. He sighs a heavy sigh, and marks it
+down::
+
+ errseq_set(&wd.wd_err, -EIO);
+
+...and then gets back to work. The supervisors eventually poll again
+and they each get the error when they next check. Subsequent calls will
+return 0, until another error is recorded, at which point it's reported
+to each of them once.
+
+Note that the supervisors can't tell how many mistakes he made, only
+whether one was made since they last checked, and the latest value
+recorded.
+
+Occasionally the big boss comes in for a spot check and asks the worker
+to do a one-off job for him. He's not really watching the worker
+full-time like the supervisors, but he does need to know whether a
+mistake occurred while his job was processing.
+
+He can just sample the current errseq_t in the worker, and then use that
+to tell whether an error has occurred later::
+
+ errseq_t since = errseq_sample(&wd.wd_err);
+ /* submit some work and wait for it to complete */
+ err = errseq_check(&wd.wd_err, since);
+
+Since he's just going to discard "since" after that point, he doesn't
+need to advance it here. He also doesn't need any locking since it's
+not usable by anyone else.
+
+Serializing errseq_t cursor updates
+===================================
+Note that the errseq_t API does not protect the errseq_t cursor during a
+check_and_advance_operation. Only the canonical error code is handled
+atomically. In a situation where more than one task might be using the
+same errseq_t cursor at the same time, it's important to serialize
+updates to that cursor.
+
+If that's not done, then it's possible for the cursor to go backward
+in which case the same error could be reported more than once.
+
+Because of this, it's often advantageous to first do an errseq_check to
+see if anything has changed, and only later do an
+errseq_check_and_advance after taking the lock. e.g.::
+
+ if (errseq_check(&wd.wd_err, READ_ONCE(su.s_wd_err)) {
+ /* su.s_wd_err is protected by s_wd_err_lock */
+ spin_lock(&su.s_wd_err_lock);
+ err = errseq_check_and_advance(&wd.wd_err, &su.s_wd_err);
+ spin_unlock(&su.s_wd_err_lock);
+ }
+
+That avoids the spinlock in the common case where nothing has changed
+since the last time it was checked.
diff --git a/Documentation/features/core/tracehook/arch-support.txt b/Documentation/features/core/tracehook/arch-support.txt
index 5e97a89420ef..dfb638c2f842 100644
--- a/Documentation/features/core/tracehook/arch-support.txt
+++ b/Documentation/features/core/tracehook/arch-support.txt
@@ -25,7 +25,7 @@
| mn10300: | ok |
| nios2: | ok |
| openrisc: | ok |
- | parisc: | TODO |
+ | parisc: | ok |
| powerpc: | ok |
| s390: | ok |
| score: | TODO |
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt
index aed6b94160b1..0eb31de3a2c1 100644
--- a/Documentation/filesystems/caching/netfs-api.txt
+++ b/Documentation/filesystems/caching/netfs-api.txt
@@ -151,8 +151,6 @@ To define an object, a structure of the following type should be filled out:
void (*mark_pages_cached)(void *cookie_netfs_data,
struct address_space *mapping,
struct pagevec *cached_pvec);
-
- void (*now_uncached)(void *cookie_netfs_data);
};
This has the following fields:
diff --git a/Documentation/filesystems/dax.txt b/Documentation/filesystems/dax.txt
index a7e6e14aeb08..3be3b266be41 100644
--- a/Documentation/filesystems/dax.txt
+++ b/Documentation/filesystems/dax.txt
@@ -63,9 +63,8 @@ Filesystem support consists of
- implementing an mmap file operation for DAX files which sets the
VM_MIXEDMAP and VM_HUGEPAGE flags on the VMA, and setting the vm_ops to
include handlers for fault, pmd_fault, page_mkwrite, pfn_mkwrite. These
- handlers should probably call dax_iomap_fault() (for fault and page_mkwrite
- handlers), dax_iomap_pmd_fault(), dax_pfn_mkwrite() passing the appropriate
- iomap operations.
+ handlers should probably call dax_iomap_fault() passing the appropriate
+ fault size and iomap operations.
- calling iomap_zero_range() passing appropriate iomap operations instead of
block_truncate_page() for DAX files
- ensuring that there is sufficient locking between reads, writes,
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 73e7d91f03dc..405a3df759b3 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -829,9 +829,7 @@ struct address_space_operations {
swap_activate: Called when swapon is used on a file to allocate
space if necessary and pin the block lookup information in
memory. A return value of zero indicates success,
- in which case this file can be used to back swapspace. The
- swapspace operations will be proxied to this address space's
- ->swap_{out,in} methods.
+ in which case this file can be used to back swapspace.
swap_deactivate: Called during swapoff on files where swap_activate
was successful.
diff --git a/Documentation/gpu/drm-internals.rst b/Documentation/gpu/drm-internals.rst
index 0d936c67bf7d..5ee9674fb9e9 100644
--- a/Documentation/gpu/drm-internals.rst
+++ b/Documentation/gpu/drm-internals.rst
@@ -201,6 +201,8 @@ drivers.
Open/Close, File Operations and IOCTLs
======================================
+.. _drm_driver_fops:
+
File Operations
---------------
diff --git a/Documentation/gpu/drm-kms-helpers.rst b/Documentation/gpu/drm-kms-helpers.rst
index 7c5e2549a58a..13dd237418cc 100644
--- a/Documentation/gpu/drm-kms-helpers.rst
+++ b/Documentation/gpu/drm-kms-helpers.rst
@@ -296,3 +296,12 @@ Auxiliary Modeset Helpers
.. kernel-doc:: drivers/gpu/drm/drm_modeset_helper.c
:export:
+
+Framebuffer GEM Helper Reference
+================================
+
+.. kernel-doc:: drivers/gpu/drm/drm_gem_framebuffer_helper.c
+ :doc: overview
+
+.. kernel-doc:: drivers/gpu/drm/drm_gem_framebuffer_helper.c
+ :export:
diff --git a/Documentation/gpu/drm-kms.rst b/Documentation/gpu/drm-kms.rst
index 2d77c9580164..307284125d7a 100644
--- a/Documentation/gpu/drm-kms.rst
+++ b/Documentation/gpu/drm-kms.rst
@@ -523,9 +523,6 @@ Color Management Properties
.. kernel-doc:: drivers/gpu/drm/drm_color_mgmt.c
:doc: overview
-.. kernel-doc:: include/drm/drm_color_mgmt.h
- :internal:
-
.. kernel-doc:: drivers/gpu/drm/drm_color_mgmt.c
:export:
@@ -554,60 +551,8 @@ various modules/drivers.
Vertical Blanking
=================
-Vertical blanking plays a major role in graphics rendering. To achieve
-tear-free display, users must synchronize page flips and/or rendering to
-vertical blanking. The DRM API offers ioctls to perform page flips
-synchronized to vertical blanking and wait for vertical blanking.
-
-The DRM core handles most of the vertical blanking management logic,
-which involves filtering out spurious interrupts, keeping race-free
-blanking counters, coping with counter wrap-around and resets and
-keeping use counts. It relies on the driver to generate vertical
-blanking interrupts and optionally provide a hardware vertical blanking
-counter. Drivers must implement the following operations.
-
-- int (\*enable_vblank) (struct drm_device \*dev, int crtc); void
- (\*disable_vblank) (struct drm_device \*dev, int crtc);
- Enable or disable vertical blanking interrupts for the given CRTC.
-
-- u32 (\*get_vblank_counter) (struct drm_device \*dev, int crtc);
- Retrieve the value of the vertical blanking counter for the given
- CRTC. If the hardware maintains a vertical blanking counter its value
- should be returned. Otherwise drivers can use the
- :c:func:`drm_vblank_count()` helper function to handle this
- operation.
-
-Drivers must initialize the vertical blanking handling core with a call
-to :c:func:`drm_vblank_init()` in their load operation.
-
-Vertical blanking interrupts can be enabled by the DRM core or by
-drivers themselves (for instance to handle page flipping operations).
-The DRM core maintains a vertical blanking use count to ensure that the
-interrupts are not disabled while a user still needs them. To increment
-the use count, drivers call :c:func:`drm_vblank_get()`. Upon
-return vertical blanking interrupts are guaranteed to be enabled.
-
-To decrement the use count drivers call
-:c:func:`drm_vblank_put()`. Only when the use count drops to zero
-will the DRM core disable the vertical blanking interrupts after a delay
-by scheduling a timer. The delay is accessible through the
-vblankoffdelay module parameter or the ``drm_vblank_offdelay`` global
-variable and expressed in milliseconds. Its default value is 5000 ms.
-Zero means never disable, and a negative value means disable
-immediately. Drivers may override the behaviour by setting the
-:c:type:`struct drm_device <drm_device>`
-vblank_disable_immediate flag, which when set causes vblank interrupts
-to be disabled immediately regardless of the drm_vblank_offdelay
-value. The flag should only be set if there's a properly working
-hardware vblank counter present.
-
-When a vertical blanking interrupt occurs drivers only need to call the
-:c:func:`drm_handle_vblank()` function to account for the
-interrupt.
-
-Resources allocated by :c:func:`drm_vblank_init()` must be freed
-with a call to :c:func:`drm_vblank_cleanup()` in the driver unload
-operation handler.
+.. kernel-doc:: drivers/gpu/drm/drm_vblank.c
+ :doc: vblank handling
Vertical Blanking and Interrupt Handling Functions Reference
------------------------------------------------------------
diff --git a/Documentation/gpu/drm-mm.rst b/Documentation/gpu/drm-mm.rst
index 9412798645c1..b08e9dcd9177 100644
--- a/Documentation/gpu/drm-mm.rst
+++ b/Documentation/gpu/drm-mm.rst
@@ -191,7 +191,7 @@ acquired and release by :c:func:`calling drm_gem_object_get()` and
holding the lock.
When the last reference to a GEM object is released the GEM core calls
-the :c:type:`struct drm_driver <drm_driver>` gem_free_object
+the :c:type:`struct drm_driver <drm_driver>` gem_free_object_unlocked
operation. That operation is mandatory for GEM-enabled drivers and must
free the GEM object and all associated resources.
@@ -492,7 +492,7 @@ DRM Sync Objects
:doc: Overview
.. kernel-doc:: include/drm/drm_syncobj.h
- :export:
+ :internal:
.. kernel-doc:: drivers/gpu/drm/drm_syncobj.c
:export:
diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
index 858457567d3d..679373b4a03f 100644
--- a/Documentation/gpu/drm-uapi.rst
+++ b/Documentation/gpu/drm-uapi.rst
@@ -160,6 +160,8 @@ other hand, a driver requires shared state between clients which is
visible to user-space and accessible beyond open-file boundaries, they
cannot support render nodes.
+.. _drm_driver_ioctl:
+
IOCTL Support on Device Nodes
=============================
diff --git a/Documentation/gpu/i915.rst b/Documentation/gpu/i915.rst
index 9c7ed3e3f1e9..2e7ee0313c1c 100644
--- a/Documentation/gpu/i915.rst
+++ b/Documentation/gpu/i915.rst
@@ -417,6 +417,10 @@ integrate with drm/i915 and to handle the `DRM_I915_PERF_OPEN` ioctl.
:functions: i915_perf_open_ioctl
.. kernel-doc:: drivers/gpu/drm/i915/i915_perf.c
:functions: i915_perf_release
+.. kernel-doc:: drivers/gpu/drm/i915/i915_perf.c
+ :functions: i915_perf_add_config_ioctl
+.. kernel-doc:: drivers/gpu/drm/i915/i915_perf.c
+ :functions: i915_perf_remove_config_ioctl
i915 Perf Stream
----------------
@@ -477,4 +481,16 @@ specific details than found in the more high-level sections.
.. kernel-doc:: drivers/gpu/drm/i915/i915_perf.c
:internal:
-.. WARNING: DOCPROC directive not supported: !Cdrivers/gpu/drm/i915/i915_irq.c
+Style
+=====
+
+The drm/i915 driver codebase has some style rules in addition to (and, in some
+cases, deviating from) the kernel coding style.
+
+Register macro definition style
+-------------------------------
+
+The style guide for ``i915_reg.h``.
+
+.. kernel-doc:: drivers/gpu/drm/i915/i915_reg.h
+ :doc: The i915 register macro definition style guide
diff --git a/Documentation/gpu/todo.rst b/Documentation/gpu/todo.rst
index 1ae42006deea..22af55d06ab8 100644
--- a/Documentation/gpu/todo.rst
+++ b/Documentation/gpu/todo.rst
@@ -108,8 +108,8 @@ This would be especially useful for tinydrm:
crtc state, clear that to the max values, x/y = 0 and w/h = MAX_INT, in
__drm_atomic_helper_crtc_duplicate_state().
-- Move tinydrm_merge_clips into drm_framebuffer.c, dropping the tinydrm_
- prefix ofc and using drm_fb_. drm_framebuffer.c makes sense since this
+- Move tinydrm_merge_clips into drm_framebuffer.c, dropping the tinydrm\_
+ prefix ofc and using drm_fb\_. drm_framebuffer.c makes sense since this
is a function useful to implement the fb->dirty function.
- Create a new drm_fb_dirty function which does essentially what e.g.
diff --git a/Documentation/hwmon/ftsteutates b/Documentation/hwmon/ftsteutates
index 8c10a916de20..af54db92391b 100644
--- a/Documentation/hwmon/ftsteutates
+++ b/Documentation/hwmon/ftsteutates
@@ -18,6 +18,10 @@ enhancements. It can monitor up to 4 voltages, 16 temperatures and
8 fans. It also contains an integrated watchdog which is currently
implemented in this driver.
+To clear a temperature or fan alarm, execute the following command with the
+correct path to the alarm file:
+ echo 0 >XXXX_alarm
+
Specification of the chip can be found here:
ftp://ftp.ts.fujitsu.com/pub/Mainboard-OEM-Sales/Services/Software&Tools/Linux_SystemMonitoring&Watchdog&GPIO/BMC-Teutates_Specification_V1.21.pdf
ftp://ftp.ts.fujitsu.com/pub/Mainboard-OEM-Sales/Services/Software&Tools/Linux_SystemMonitoring&Watchdog&GPIO/Fujitsu_mainboards-1-Sensors_HowTo-en-US.pdf
diff --git a/Documentation/hwmon/ibm-cffps b/Documentation/hwmon/ibm-cffps
new file mode 100644
index 000000000000..e05ecd8ecfcf
--- /dev/null
+++ b/Documentation/hwmon/ibm-cffps
@@ -0,0 +1,54 @@
+Kernel driver ibm-cffps
+=======================
+
+Supported chips:
+ * IBM Common Form Factor power supply
+
+Author: Eddie James <eajames@us.ibm.com>
+
+Description
+-----------
+
+This driver supports IBM Common Form Factor (CFF) power supplies. This driver
+is a client to the core PMBus driver.
+
+Usage Notes
+-----------
+
+This driver does not auto-detect devices. You will have to instantiate the
+devices explicitly. Please see Documentation/i2c/instantiating-devices for
+details.
+
+Sysfs entries
+-------------
+
+The following attributes are supported:
+
+curr1_alarm Output current over-current alarm.
+curr1_input Measured output current in mA.
+curr1_label "iout1"
+
+fan1_alarm Fan 1 warning.
+fan1_fault Fan 1 fault.
+fan1_input Fan 1 speed in RPM.
+fan2_alarm Fan 2 warning.
+fan2_fault Fan 2 fault.
+fan2_input Fan 2 speed in RPM.
+
+in1_alarm Input voltage under-voltage alarm.
+in1_input Measured input voltage in mV.
+in1_label "vin"
+in2_alarm Output voltage over-voltage alarm.
+in2_input Measured output voltage in mV.
+in2_label "vout1"
+
+power1_alarm Input fault or alarm.
+power1_input Measured input power in uW.
+power1_label "pin"
+
+temp1_alarm PSU inlet ambient temperature over-temperature alarm.
+temp1_input Measured PSU inlet ambient temp in millidegrees C.
+temp2_alarm Secondary rectifier temp over-temperature alarm.
+temp2_input Measured secondary rectifier temp in millidegrees C.
+temp3_alarm ORing FET temperature over-temperature alarm.
+temp3_input Measured ORing FET temperature in millidegrees C.
diff --git a/Documentation/hwmon/lm25066 b/Documentation/hwmon/lm25066
index 2cb20ebb234d..3fa6bf820c88 100644
--- a/Documentation/hwmon/lm25066
+++ b/Documentation/hwmon/lm25066
@@ -29,6 +29,11 @@ Supported chips:
Addresses scanned: -
Datasheet:
http://www.national.com/pf/LM/LM5066.html
+ * Texas Instruments LM5066I
+ Prefix: 'lm5066i'
+ Addresses scanned: -
+ Datasheet:
+ http://www.ti.com/product/LM5066I
Author: Guenter Roeck <linux@roeck-us.net>
@@ -37,8 +42,8 @@ Description
-----------
This driver supports hardware monitoring for National Semiconductor / TI LM25056,
-LM25063, LM25066, LM5064, and LM5066 Power Management, Monitoring, Control, and
-Protection ICs.
+LM25063, LM25066, LM5064, and LM5066/LM5066I Power Management, Monitoring,
+Control, and Protection ICs.
The driver is a client driver to the core PMBus driver. Please see
Documentation/hwmon/pmbus for details on PMBus client drivers.
diff --git a/Documentation/iio/ep93xx_adc.txt b/Documentation/iio/ep93xx_adc.txt
new file mode 100644
index 000000000000..23053e7817bd
--- /dev/null
+++ b/Documentation/iio/ep93xx_adc.txt
@@ -0,0 +1,29 @@
+Cirrus Logic EP93xx ADC driver.
+
+1. Overview
+
+The driver is intended to work on both low-end (EP9301, EP9302) devices with
+5-channel ADC and high-end (EP9307, EP9312, EP9315) devices with 10-channel
+touchscreen/ADC module.
+
+2. Channel numbering
+
+Numbering scheme for channels 0..4 is defined in EP9301 and EP9302 datasheets.
+EP9307, EP9312 and EP9312 have 3 channels more (total 8), but the numbering is
+not defined. So the last three are numbered randomly, let's say.
+
+Assuming ep93xx_adc is IIO device0, you'd find the following entries under
+/sys/bus/iio/devices/iio:device0/:
+
+ +-----------------+---------------+
+ | sysfs entry | ball/pin name |
+ +-----------------+---------------+
+ | in_voltage0_raw | YM |
+ | in_voltage1_raw | SXP |
+ | in_voltage2_raw | SXM |
+ | in_voltage3_raw | SYP |
+ | in_voltage4_raw | SYM |
+ | in_voltage5_raw | XP |
+ | in_voltage6_raw | XM |
+ | in_voltage7_raw | YP |
+ +-----------------+---------------+
diff --git a/Documentation/infiniband/tag_matching.txt b/Documentation/infiniband/tag_matching.txt
new file mode 100644
index 000000000000..d2a3bf819226
--- /dev/null
+++ b/Documentation/infiniband/tag_matching.txt
@@ -0,0 +1,64 @@
+Tag matching logic
+
+The MPI standard defines a set of rules, known as tag-matching, for matching
+source send operations to destination receives. The following parameters must
+match the following source and destination parameters:
+* Communicator
+* User tag - wild card may be specified by the receiver
+* Source rank – wild car may be specified by the receiver
+* Destination rank – wild
+The ordering rules require that when more than one pair of send and receive
+message envelopes may match, the pair that includes the earliest posted-send
+and the earliest posted-receive is the pair that must be used to satisfy the
+matching operation. However, this doesn’t imply that tags are consumed in
+the order they are created, e.g., a later generated tag may be consumed, if
+earlier tags can’t be used to satisfy the matching rules.
+
+When a message is sent from the sender to the receiver, the communication
+library may attempt to process the operation either after or before the
+corresponding matching receive is posted. If a matching receive is posted,
+this is an expected message, otherwise it is called an unexpected message.
+Implementations frequently use different matching schemes for these two
+different matching instances.
+
+To keep MPI library memory footprint down, MPI implementations typically use
+two different protocols for this purpose:
+
+1. The Eager protocol- the complete message is sent when the send is
+processed by the sender. A completion send is received in the send_cq
+notifying that the buffer can be reused.
+
+2. The Rendezvous Protocol - the sender sends the tag-matching header,
+and perhaps a portion of data when first notifying the receiver. When the
+corresponding buffer is posted, the responder will use the information from
+the header to initiate an RDMA READ operation directly to the matching buffer.
+A fin message needs to be received in order for the buffer to be reused.
+
+Tag matching implementation
+
+There are two types of matching objects used, the posted receive list and the
+unexpected message list. The application posts receive buffers through calls
+to the MPI receive routines in the posted receive list and posts send messages
+using the MPI send routines. The head of the posted receive list may be
+maintained by the hardware, with the software expected to shadow this list.
+
+When send is initiated and arrives at the receive side, if there is no
+pre-posted receive for this arriving message, it is passed to the software and
+placed in the unexpected message list. Otherwise the match is processed,
+including rendezvous processing, if appropriate, delivering the data to the
+specified receive buffer. This allows overlapping receive-side MPI tag
+matching with computation.
+
+When a receive-message is posted, the communication library will first check
+the software unexpected message list for a matching receive. If a match is
+found, data is delivered to the user buffer, using a software controlled
+protocol. The UCX implementation uses either an eager or rendezvous protocol,
+depending on data size. If no match is found, the entire pre-posted receive
+list is maintained by the hardware, and there is space to add one more
+pre-posted receive to this list, this receive is passed to the hardware.
+Software is expected to shadow this list, to help with processing MPI cancel
+operations. In addition, because hardware and software are not expected to be
+tightly synchronized with respect to the tag-matching operation, this shadow
+list is used to detect the case that a pre-posted receive is passed to the
+hardware, as the matching unexpected message is being passed from the hardware
+to the software.
diff --git a/Documentation/input/input.rst b/Documentation/input/input.rst
index 3b3a22975106..47f86a4bf16c 100644
--- a/Documentation/input/input.rst
+++ b/Documentation/input/input.rst
@@ -109,7 +109,7 @@ evdev nodes are created with minors starting with 256.
keyboard
~~~~~~~~
-``keyboard`` is in-kernel input handler ad is a part of VT code. It
+``keyboard`` is in-kernel input handler and is a part of VT code. It
consumes keyboard keystrokes and handles user input for VT consoles.
mousedev
diff --git a/Documentation/input/joydev/index.rst b/Documentation/input/joydev/index.rst
index 8d9666c7561c..ebcff43056e2 100644
--- a/Documentation/input/joydev/index.rst
+++ b/Documentation/input/joydev/index.rst
@@ -12,7 +12,6 @@ Linux Joystick support
.. toctree::
:maxdepth: 3
- :numbered:
joystick
joystick-api
diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt
index 7003141a6d4f..329e740adea7 100644
--- a/Documentation/kbuild/makefiles.txt
+++ b/Documentation/kbuild/makefiles.txt
@@ -297,9 +297,9 @@ more details, with real examples.
ccflags-y specifies options for compiling with $(CC).
Example:
- # drivers/acpi/Makefile
- ccflags-y := -Os
- ccflags-$(CONFIG_ACPI_DEBUG) += -DACPI_DEBUG_OUTPUT
+ # drivers/acpi/acpica/Makefile
+ ccflags-y := -Os -D_LINUX -DBUILDING_ACPICA
+ ccflags-$(CONFIG_ACPI_DEBUG) += -DACPI_DEBUG_OUTPUT
This variable is necessary because the top Makefile owns the
variable $(KBUILD_CFLAGS) and uses it for compilation flags for the
diff --git a/Documentation/locking/crossrelease.txt b/Documentation/locking/crossrelease.txt
new file mode 100644
index 000000000000..bdf1423d5f99
--- /dev/null
+++ b/Documentation/locking/crossrelease.txt
@@ -0,0 +1,874 @@
+Crossrelease
+============
+
+Started by Byungchul Park <byungchul.park@lge.com>
+
+Contents:
+
+ (*) Background
+
+ - What causes deadlock
+ - How lockdep works
+
+ (*) Limitation
+
+ - Limit lockdep
+ - Pros from the limitation
+ - Cons from the limitation
+ - Relax the limitation
+
+ (*) Crossrelease
+
+ - Introduce crossrelease
+ - Introduce commit
+
+ (*) Implementation
+
+ - Data structures
+ - How crossrelease works
+
+ (*) Optimizations
+
+ - Avoid duplication
+ - Lockless for hot paths
+
+ (*) APPENDIX A: What lockdep does to work aggresively
+
+ (*) APPENDIX B: How to avoid adding false dependencies
+
+
+==========
+Background
+==========
+
+What causes deadlock
+--------------------
+
+A deadlock occurs when a context is waiting for an event to happen,
+which is impossible because another (or the) context who can trigger the
+event is also waiting for another (or the) event to happen, which is
+also impossible due to the same reason.
+
+For example:
+
+ A context going to trigger event C is waiting for event A to happen.
+ A context going to trigger event A is waiting for event B to happen.
+ A context going to trigger event B is waiting for event C to happen.
+
+A deadlock occurs when these three wait operations run at the same time,
+because event C cannot be triggered if event A does not happen, which in
+turn cannot be triggered if event B does not happen, which in turn
+cannot be triggered if event C does not happen. After all, no event can
+be triggered since any of them never meets its condition to wake up.
+
+A dependency might exist between two waiters and a deadlock might happen
+due to an incorrect releationship between dependencies. Thus, we must
+define what a dependency is first. A dependency exists between them if:
+
+ 1. There are two waiters waiting for each event at a given time.
+ 2. The only way to wake up each waiter is to trigger its event.
+ 3. Whether one can be woken up depends on whether the other can.
+
+Each wait in the example creates its dependency like:
+
+ Event C depends on event A.
+ Event A depends on event B.
+ Event B depends on event C.
+
+ NOTE: Precisely speaking, a dependency is one between whether a
+ waiter for an event can be woken up and whether another waiter for
+ another event can be woken up. However from now on, we will describe
+ a dependency as if it's one between an event and another event for
+ simplicity.
+
+And they form circular dependencies like:
+
+ -> C -> A -> B -
+ / \
+ \ /
+ ----------------
+
+ where 'A -> B' means that event A depends on event B.
+
+Such circular dependencies lead to a deadlock since no waiter can meet
+its condition to wake up as described.
+
+CONCLUSION
+
+Circular dependencies cause a deadlock.
+
+
+How lockdep works
+-----------------
+
+Lockdep tries to detect a deadlock by checking dependencies created by
+lock operations, acquire and release. Waiting for a lock corresponds to
+waiting for an event, and releasing a lock corresponds to triggering an
+event in the previous section.
+
+In short, lockdep does:
+
+ 1. Detect a new dependency.
+ 2. Add the dependency into a global graph.
+ 3. Check if that makes dependencies circular.
+ 4. Report a deadlock or its possibility if so.
+
+For example, consider a graph built by lockdep that looks like:
+
+ A -> B -
+ \
+ -> E
+ /
+ C -> D -
+
+ where A, B,..., E are different lock classes.
+
+Lockdep will add a dependency into the graph on detection of a new
+dependency. For example, it will add a dependency 'E -> C' when a new
+dependency between lock E and lock C is detected. Then the graph will be:
+
+ A -> B -
+ \
+ -> E -
+ / \
+ -> C -> D - \
+ / /
+ \ /
+ ------------------
+
+ where A, B,..., E are different lock classes.
+
+This graph contains a subgraph which demonstrates circular dependencies:
+
+ -> E -
+ / \
+ -> C -> D - \
+ / /
+ \ /
+ ------------------
+
+ where C, D and E are different lock classes.
+
+This is the condition under which a deadlock might occur. Lockdep
+reports it on detection after adding a new dependency. This is the way
+how lockdep works.
+
+CONCLUSION
+
+Lockdep detects a deadlock or its possibility by checking if circular
+dependencies were created after adding each new dependency.
+
+
+==========
+Limitation
+==========
+
+Limit lockdep
+-------------
+
+Limiting lockdep to work on only typical locks e.g. spin locks and
+mutexes, which are released within the acquire context, the
+implementation becomes simple but its capacity for detection becomes
+limited. Let's check pros and cons in next section.
+
+
+Pros from the limitation
+------------------------
+
+Given the limitation, when acquiring a lock, locks in a held_locks
+cannot be released if the context cannot acquire it so has to wait to
+acquire it, which means all waiters for the locks in the held_locks are
+stuck. It's an exact case to create dependencies between each lock in
+the held_locks and the lock to acquire.
+
+For example:
+
+ CONTEXT X
+ ---------
+ acquire A
+ acquire B /* Add a dependency 'A -> B' */
+ release B
+ release A
+
+ where A and B are different lock classes.
+
+When acquiring lock A, the held_locks of CONTEXT X is empty thus no
+dependency is added. But when acquiring lock B, lockdep detects and adds
+a new dependency 'A -> B' between lock A in the held_locks and lock B.
+They can be simply added whenever acquiring each lock.
+
+And data required by lockdep exists in a local structure, held_locks
+embedded in task_struct. Forcing to access the data within the context,
+lockdep can avoid racy problems without explicit locks while handling
+the local data.
+
+Lastly, lockdep only needs to keep locks currently being held, to build
+a dependency graph. However, relaxing the limitation, it needs to keep
+even locks already released, because a decision whether they created
+dependencies might be long-deferred.
+
+To sum up, we can expect several advantages from the limitation:
+
+ 1. Lockdep can easily identify a dependency when acquiring a lock.
+ 2. Races are avoidable while accessing local locks in a held_locks.
+ 3. Lockdep only needs to keep locks currently being held.
+
+CONCLUSION
+
+Given the limitation, the implementation becomes simple and efficient.
+
+
+Cons from the limitation
+------------------------
+
+Given the limitation, lockdep is applicable only to typical locks. For
+example, page locks for page access or completions for synchronization
+cannot work with lockdep.
+
+Can we detect deadlocks below, under the limitation?
+
+Example 1:
+
+ CONTEXT X CONTEXT Y CONTEXT Z
+ --------- --------- ----------
+ mutex_lock A
+ lock_page B
+ lock_page B
+ mutex_lock A /* DEADLOCK */
+ unlock_page B held by X
+ unlock_page B
+ mutex_unlock A
+ mutex_unlock A
+
+ where A and B are different lock classes.
+
+No, we cannot.
+
+Example 2:
+
+ CONTEXT X CONTEXT Y
+ --------- ---------
+ mutex_lock A
+ mutex_lock A
+ wait_for_complete B /* DEADLOCK */
+ complete B
+ mutex_unlock A
+ mutex_unlock A
+
+ where A is a lock class and B is a completion variable.
+
+No, we cannot.
+
+CONCLUSION
+
+Given the limitation, lockdep cannot detect a deadlock or its
+possibility caused by page locks or completions.
+
+
+Relax the limitation
+--------------------
+
+Under the limitation, things to create dependencies are limited to
+typical locks. However, synchronization primitives like page locks and
+completions, which are allowed to be released in any context, also
+create dependencies and can cause a deadlock. So lockdep should track
+these locks to do a better job. We have to relax the limitation for
+these locks to work with lockdep.
+
+Detecting dependencies is very important for lockdep to work because
+adding a dependency means adding an opportunity to check whether it
+causes a deadlock. The more lockdep adds dependencies, the more it
+thoroughly works. Thus Lockdep has to do its best to detect and add as
+many true dependencies into a graph as possible.
+
+For example, considering only typical locks, lockdep builds a graph like:
+
+ A -> B -
+ \
+ -> E
+ /
+ C -> D -
+
+ where A, B,..., E are different lock classes.
+
+On the other hand, under the relaxation, additional dependencies might
+be created and added. Assuming additional 'FX -> C' and 'E -> GX' are
+added thanks to the relaxation, the graph will be:
+
+ A -> B -
+ \
+ -> E -> GX
+ /
+ FX -> C -> D -
+
+ where A, B,..., E, FX and GX are different lock classes, and a suffix
+ 'X' is added on non-typical locks.
+
+The latter graph gives us more chances to check circular dependencies
+than the former. However, it might suffer performance degradation since
+relaxing the limitation, with which design and implementation of lockdep
+can be efficient, might introduce inefficiency inevitably. So lockdep
+should provide two options, strong detection and efficient detection.
+
+Choosing efficient detection:
+
+ Lockdep works with only locks restricted to be released within the
+ acquire context. However, lockdep works efficiently.
+
+Choosing strong detection:
+
+ Lockdep works with all synchronization primitives. However, lockdep
+ suffers performance degradation.
+
+CONCLUSION
+
+Relaxing the limitation, lockdep can add additional dependencies giving
+additional opportunities to check circular dependencies.
+
+
+============
+Crossrelease
+============
+
+Introduce crossrelease
+----------------------
+
+In order to allow lockdep to handle additional dependencies by what
+might be released in any context, namely 'crosslock', we have to be able
+to identify those created by crosslocks. The proposed 'crossrelease'
+feature provoides a way to do that.
+
+Crossrelease feature has to do:
+
+ 1. Identify dependencies created by crosslocks.
+ 2. Add the dependencies into a dependency graph.
+
+That's all. Once a meaningful dependency is added into graph, then
+lockdep would work with the graph as it did. The most important thing
+crossrelease feature has to do is to correctly identify and add true
+dependencies into the global graph.
+
+A dependency e.g. 'A -> B' can be identified only in the A's release
+context because a decision required to identify the dependency can be
+made only in the release context. That is to decide whether A can be
+released so that a waiter for A can be woken up. It cannot be made in
+other than the A's release context.
+
+It's no matter for typical locks because each acquire context is same as
+its release context, thus lockdep can decide whether a lock can be
+released in the acquire context. However for crosslocks, lockdep cannot
+make the decision in the acquire context but has to wait until the
+release context is identified.
+
+Therefore, deadlocks by crosslocks cannot be detected just when it
+happens, because those cannot be identified until the crosslocks are
+released. However, deadlock possibilities can be detected and it's very
+worth. See 'APPENDIX A' section to check why.
+
+CONCLUSION
+
+Using crossrelease feature, lockdep can work with what might be released
+in any context, namely crosslock.
+
+
+Introduce commit
+----------------
+
+Since crossrelease defers the work adding true dependencies of
+crosslocks until they are actually released, crossrelease has to queue
+all acquisitions which might create dependencies with the crosslocks.
+Then it identifies dependencies using the queued data in batches at a
+proper time. We call it 'commit'.
+
+There are four types of dependencies:
+
+1. TT type: 'typical lock A -> typical lock B'
+
+ Just when acquiring B, lockdep can see it's in the A's release
+ context. So the dependency between A and B can be identified
+ immediately. Commit is unnecessary.
+
+2. TC type: 'typical lock A -> crosslock BX'
+
+ Just when acquiring BX, lockdep can see it's in the A's release
+ context. So the dependency between A and BX can be identified
+ immediately. Commit is unnecessary, too.
+
+3. CT type: 'crosslock AX -> typical lock B'
+
+ When acquiring B, lockdep cannot identify the dependency because
+ there's no way to know if it's in the AX's release context. It has
+ to wait until the decision can be made. Commit is necessary.
+
+4. CC type: 'crosslock AX -> crosslock BX'
+
+ When acquiring BX, lockdep cannot identify the dependency because
+ there's no way to know if it's in the AX's release context. It has
+ to wait until the decision can be made. Commit is necessary.
+ But, handling CC type is not implemented yet. It's a future work.
+
+Lockdep can work without commit for typical locks, but commit step is
+necessary once crosslocks are involved. Introducing commit, lockdep
+performs three steps. What lockdep does in each step is:
+
+1. Acquisition: For typical locks, lockdep does what it originally did
+ and queues the lock so that CT type dependencies can be checked using
+ it at the commit step. For crosslocks, it saves data which will be
+ used at the commit step and increases a reference count for it.
+
+2. Commit: No action is reauired for typical locks. For crosslocks,
+ lockdep adds CT type dependencies using the data saved at the
+ acquisition step.
+
+3. Release: No changes are required for typical locks. When a crosslock
+ is released, it decreases a reference count for it.
+
+CONCLUSION
+
+Crossrelease introduces commit step to handle dependencies of crosslocks
+in batches at a proper time.
+
+
+==============
+Implementation
+==============
+
+Data structures
+---------------
+
+Crossrelease introduces two main data structures.
+
+1. hist_lock
+
+ This is an array embedded in task_struct, for keeping lock history so
+ that dependencies can be added using them at the commit step. Since
+ it's local data, it can be accessed locklessly in the owner context.
+ The array is filled at the acquisition step and consumed at the
+ commit step. And it's managed in circular manner.
+
+2. cross_lock
+
+ One per lockdep_map exists. This is for keeping data of crosslocks
+ and used at the commit step.
+
+
+How crossrelease works
+----------------------
+
+It's the key of how crossrelease works, to defer necessary works to an
+appropriate point in time and perform in at once at the commit step.
+Let's take a look with examples step by step, starting from how lockdep
+works without crossrelease for typical locks.
+
+ acquire A /* Push A onto held_locks */
+ acquire B /* Push B onto held_locks and add 'A -> B' */
+ acquire C /* Push C onto held_locks and add 'B -> C' */
+ release C /* Pop C from held_locks */
+ release B /* Pop B from held_locks */
+ release A /* Pop A from held_locks */
+
+ where A, B and C are different lock classes.
+
+ NOTE: This document assumes that readers already understand how
+ lockdep works without crossrelease thus omits details. But there's
+ one thing to note. Lockdep pretends to pop a lock from held_locks
+ when releasing it. But it's subtly different from the original pop
+ operation because lockdep allows other than the top to be poped.
+
+In this case, lockdep adds 'the top of held_locks -> the lock to acquire'
+dependency every time acquiring a lock.
+
+After adding 'A -> B', a dependency graph will be:
+
+ A -> B
+
+ where A and B are different lock classes.
+
+And after adding 'B -> C', the graph will be:
+
+ A -> B -> C
+
+ where A, B and C are different lock classes.
+
+Let's performs commit step even for typical locks to add dependencies.
+Of course, commit step is not necessary for them, however, it would work
+well because this is a more general way.
+
+ acquire A
+ /*
+ * Queue A into hist_locks
+ *
+ * In hist_locks: A
+ * In graph: Empty
+ */
+
+ acquire B
+ /*
+ * Queue B into hist_locks
+ *
+ * In hist_locks: A, B
+ * In graph: Empty
+ */
+
+ acquire C
+ /*
+ * Queue C into hist_locks
+ *
+ * In hist_locks: A, B, C
+ * In graph: Empty
+ */
+
+ commit C
+ /*
+ * Add 'C -> ?'
+ * Answer the following to decide '?'
+ * What has been queued since acquire C: Nothing
+ *
+ * In hist_locks: A, B, C
+ * In graph: Empty
+ */
+
+ release C
+
+ commit B
+ /*
+ * Add 'B -> ?'
+ * Answer the following to decide '?'
+ * What has been queued since acquire B: C
+ *
+ * In hist_locks: A, B, C
+ * In graph: 'B -> C'
+ */
+
+ release B
+
+ commit A
+ /*
+ * Add 'A -> ?'
+ * Answer the following to decide '?'
+ * What has been queued since acquire A: B, C
+ *
+ * In hist_locks: A, B, C
+ * In graph: 'B -> C', 'A -> B', 'A -> C'
+ */
+
+ release A
+
+ where A, B and C are different lock classes.
+
+In this case, dependencies are added at the commit step as described.
+
+After commits for A, B and C, the graph will be:
+
+ A -> B -> C
+
+ where A, B and C are different lock classes.
+
+ NOTE: A dependency 'A -> C' is optimized out.
+
+We can see the former graph built without commit step is same as the
+latter graph built using commit steps. Of course the former way leads to
+earlier finish for building the graph, which means we can detect a
+deadlock or its possibility sooner. So the former way would be prefered
+when possible. But we cannot avoid using the latter way for crosslocks.
+
+Let's look at how commit steps work for crosslocks. In this case, the
+commit step is performed only on crosslock AX as real. And it assumes
+that the AX release context is different from the AX acquire context.
+
+ BX RELEASE CONTEXT BX ACQUIRE CONTEXT
+ ------------------ ------------------
+ acquire A
+ /*
+ * Push A onto held_locks
+ * Queue A into hist_locks
+ *
+ * In held_locks: A
+ * In hist_locks: A
+ * In graph: Empty
+ */
+
+ acquire BX
+ /*
+ * Add 'the top of held_locks -> BX'
+ *
+ * In held_locks: A
+ * In hist_locks: A
+ * In graph: 'A -> BX'
+ */
+
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ It must be guaranteed that the following operations are seen after
+ acquiring BX globally. It can be done by things like barrier.
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ acquire C
+ /*
+ * Push C onto held_locks
+ * Queue C into hist_locks
+ *
+ * In held_locks: C
+ * In hist_locks: C
+ * In graph: 'A -> BX'
+ */
+
+ release C
+ /*
+ * Pop C from held_locks
+ *
+ * In held_locks: Empty
+ * In hist_locks: C
+ * In graph: 'A -> BX'
+ */
+ acquire D
+ /*
+ * Push D onto held_locks
+ * Queue D into hist_locks
+ * Add 'the top of held_locks -> D'
+ *
+ * In held_locks: A, D
+ * In hist_locks: A, D
+ * In graph: 'A -> BX', 'A -> D'
+ */
+ acquire E
+ /*
+ * Push E onto held_locks
+ * Queue E into hist_locks
+ *
+ * In held_locks: E
+ * In hist_locks: C, E
+ * In graph: 'A -> BX', 'A -> D'
+ */
+
+ release E
+ /*
+ * Pop E from held_locks
+ *
+ * In held_locks: Empty
+ * In hist_locks: D, E
+ * In graph: 'A -> BX', 'A -> D'
+ */
+ release D
+ /*
+ * Pop D from held_locks
+ *
+ * In held_locks: A
+ * In hist_locks: A, D
+ * In graph: 'A -> BX', 'A -> D'
+ */
+ commit BX
+ /*
+ * Add 'BX -> ?'
+ * What has been queued since acquire BX: C, E
+ *
+ * In held_locks: Empty
+ * In hist_locks: D, E
+ * In graph: 'A -> BX', 'A -> D',
+ * 'BX -> C', 'BX -> E'
+ */
+
+ release BX
+ /*
+ * In held_locks: Empty
+ * In hist_locks: D, E
+ * In graph: 'A -> BX', 'A -> D',
+ * 'BX -> C', 'BX -> E'
+ */
+ release A
+ /*
+ * Pop A from held_locks
+ *
+ * In held_locks: Empty
+ * In hist_locks: A, D
+ * In graph: 'A -> BX', 'A -> D',
+ * 'BX -> C', 'BX -> E'
+ */
+
+ where A, BX, C,..., E are different lock classes, and a suffix 'X' is
+ added on crosslocks.
+
+Crossrelease considers all acquisitions after acqiuring BX are
+candidates which might create dependencies with BX. True dependencies
+will be determined when identifying the release context of BX. Meanwhile,
+all typical locks are queued so that they can be used at the commit step.
+And then two dependencies 'BX -> C' and 'BX -> E' are added at the
+commit step when identifying the release context.
+
+The final graph will be, with crossrelease:
+
+ -> C
+ /
+ -> BX -
+ / \
+ A - -> E
+ \
+ -> D
+
+ where A, BX, C,..., E are different lock classes, and a suffix 'X' is
+ added on crosslocks.
+
+However, the final graph will be, without crossrelease:
+
+ A -> D
+
+ where A and D are different lock classes.
+
+The former graph has three more dependencies, 'A -> BX', 'BX -> C' and
+'BX -> E' giving additional opportunities to check if they cause
+deadlocks. This way lockdep can detect a deadlock or its possibility
+caused by crosslocks.
+
+CONCLUSION
+
+We checked how crossrelease works with several examples.
+
+
+=============
+Optimizations
+=============
+
+Avoid duplication
+-----------------
+
+Crossrelease feature uses a cache like what lockdep already uses for
+dependency chains, but this time it's for caching CT type dependencies.
+Once that dependency is cached, the same will never be added again.
+
+
+Lockless for hot paths
+----------------------
+
+To keep all locks for later use at the commit step, crossrelease adopts
+a local array embedded in task_struct, which makes access to the data
+lockless by forcing it to happen only within the owner context. It's
+like how lockdep handles held_locks. Lockless implmentation is important
+since typical locks are very frequently acquired and released.
+
+
+=================================================
+APPENDIX A: What lockdep does to work aggresively
+=================================================
+
+A deadlock actually occurs when all wait operations creating circular
+dependencies run at the same time. Even though they don't, a potential
+deadlock exists if the problematic dependencies exist. Thus it's
+meaningful to detect not only an actual deadlock but also its potential
+possibility. The latter is rather valuable. When a deadlock occurs
+actually, we can identify what happens in the system by some means or
+other even without lockdep. However, there's no way to detect possiblity
+without lockdep unless the whole code is parsed in head. It's terrible.
+Lockdep does the both, and crossrelease only focuses on the latter.
+
+Whether or not a deadlock actually occurs depends on several factors.
+For example, what order contexts are switched in is a factor. Assuming
+circular dependencies exist, a deadlock would occur when contexts are
+switched so that all wait operations creating the dependencies run
+simultaneously. Thus to detect a deadlock possibility even in the case
+that it has not occured yet, lockdep should consider all possible
+combinations of dependencies, trying to:
+
+1. Use a global dependency graph.
+
+ Lockdep combines all dependencies into one global graph and uses them,
+ regardless of which context generates them or what order contexts are
+ switched in. Aggregated dependencies are only considered so they are
+ prone to be circular if a problem exists.
+
+2. Check dependencies between classes instead of instances.
+
+ What actually causes a deadlock are instances of lock. However,
+ lockdep checks dependencies between classes instead of instances.
+ This way lockdep can detect a deadlock which has not happened but
+ might happen in future by others but the same class.
+
+3. Assume all acquisitions lead to waiting.
+
+ Although locks might be acquired without waiting which is essential
+ to create dependencies, lockdep assumes all acquisitions lead to
+ waiting since it might be true some time or another.
+
+CONCLUSION
+
+Lockdep detects not only an actual deadlock but also its possibility,
+and the latter is more valuable.
+
+
+==================================================
+APPENDIX B: How to avoid adding false dependencies
+==================================================
+
+Remind what a dependency is. A dependency exists if:
+
+ 1. There are two waiters waiting for each event at a given time.
+ 2. The only way to wake up each waiter is to trigger its event.
+ 3. Whether one can be woken up depends on whether the other can.
+
+For example:
+
+ acquire A
+ acquire B /* A dependency 'A -> B' exists */
+ release B
+ release A
+
+ where A and B are different lock classes.
+
+A depedency 'A -> B' exists since:
+
+ 1. A waiter for A and a waiter for B might exist when acquiring B.
+ 2. Only way to wake up each is to release what it waits for.
+ 3. Whether the waiter for A can be woken up depends on whether the
+ other can. IOW, TASK X cannot release A if it fails to acquire B.
+
+For another example:
+
+ TASK X TASK Y
+ ------ ------
+ acquire AX
+ acquire B /* A dependency 'AX -> B' exists */
+ release B
+ release AX held by Y
+
+ where AX and B are different lock classes, and a suffix 'X' is added
+ on crosslocks.
+
+Even in this case involving crosslocks, the same rule can be applied. A
+depedency 'AX -> B' exists since:
+
+ 1. A waiter for AX and a waiter for B might exist when acquiring B.
+ 2. Only way to wake up each is to release what it waits for.
+ 3. Whether the waiter for AX can be woken up depends on whether the
+ other can. IOW, TASK X cannot release AX if it fails to acquire B.
+
+Let's take a look at more complicated example:
+
+ TASK X TASK Y
+ ------ ------
+ acquire B
+ release B
+ fork Y
+ acquire AX
+ acquire C /* A dependency 'AX -> C' exists */
+ release C
+ release AX held by Y
+
+ where AX, B and C are different lock classes, and a suffix 'X' is
+ added on crosslocks.
+
+Does a dependency 'AX -> B' exist? Nope.
+
+Two waiters are essential to create a dependency. However, waiters for
+AX and B to create 'AX -> B' cannot exist at the same time in this
+example. Thus the dependency 'AX -> B' cannot be created.
+
+It would be ideal if the full set of true ones can be considered. But
+we can ensure nothing but what actually happened. Relying on what
+actually happens at runtime, we can anyway add only true ones, though
+they might be a subset of true ones. It's similar to how lockdep works
+for typical locks. There might be more true dependencies than what
+lockdep has detected in runtime. Lockdep has no choice but to rely on
+what actually happens. Crossrelease also relies on it.
+
+CONCLUSION
+
+Relying on what actually happens, lockdep can avoid adding false
+dependencies.
diff --git a/Documentation/locking/rt-mutex-design.txt b/Documentation/locking/rt-mutex-design.txt
index 8666070d3189..6c6e8c2410de 100644
--- a/Documentation/locking/rt-mutex-design.txt
+++ b/Documentation/locking/rt-mutex-design.txt
@@ -97,9 +97,9 @@ waiter - A waiter is a struct that is stored on the stack of a blocked
a process being blocked on the mutex, it is fine to allocate
the waiter on the process's stack (local variable). This
structure holds a pointer to the task, as well as the mutex that
- the task is blocked on. It also has the plist node structures to
- place the task in the waiter_list of a mutex as well as the
- pi_list of a mutex owner task (described below).
+ the task is blocked on. It also has rbtree node structures to
+ place the task in the waiters rbtree of a mutex as well as the
+ pi_waiters rbtree of a mutex owner task (described below).
waiter is sometimes used in reference to the task that is waiting
on a mutex. This is the same as waiter->task.
@@ -179,53 +179,34 @@ again.
|
F->L5-+
+If process G has the highest priority in the chain, then all the tasks up
+the chain (A and B in this example), must have their priorities increased
+to that of G.
-Plist
------
-
-Before I go further and talk about how the PI chain is stored through lists
-on both mutexes and processes, I'll explain the plist. This is similar to
-the struct list_head functionality that is already in the kernel.
-The implementation of plist is out of scope for this document, but it is
-very important to understand what it does.
-
-There are a few differences between plist and list, the most important one
-being that plist is a priority sorted linked list. This means that the
-priorities of the plist are sorted, such that it takes O(1) to retrieve the
-highest priority item in the list. Obviously this is useful to store processes
-based on their priorities.
-
-Another difference, which is important for implementation, is that, unlike
-list, the head of the list is a different element than the nodes of a list.
-So the head of the list is declared as struct plist_head and nodes that will
-be added to the list are declared as struct plist_node.
-
-
-Mutex Waiter List
+Mutex Waiters Tree
-----------------
-Every mutex keeps track of all the waiters that are blocked on itself. The mutex
-has a plist to store these waiters by priority. This list is protected by
-a spin lock that is located in the struct of the mutex. This lock is called
-wait_lock. Since the modification of the waiter list is never done in
-interrupt context, the wait_lock can be taken without disabling interrupts.
+Every mutex keeps track of all the waiters that are blocked on itself. The
+mutex has a rbtree to store these waiters by priority. This tree is protected
+by a spin lock that is located in the struct of the mutex. This lock is called
+wait_lock.
-Task PI List
+Task PI Tree
------------
-To keep track of the PI chains, each process has its own PI list. This is
-a list of all top waiters of the mutexes that are owned by the process.
-Note that this list only holds the top waiters and not all waiters that are
+To keep track of the PI chains, each process has its own PI rbtree. This is
+a tree of all top waiters of the mutexes that are owned by the process.
+Note that this tree only holds the top waiters and not all waiters that are
blocked on mutexes owned by the process.
-The top of the task's PI list is always the highest priority task that
+The top of the task's PI tree is always the highest priority task that
is waiting on a mutex that is owned by the task. So if the task has
inherited a priority, it will always be the priority of the task that is
-at the top of this list.
+at the top of this tree.
-This list is stored in the task structure of a process as a plist called
-pi_list. This list is protected by a spin lock also in the task structure,
+This tree is stored in the task structure of a process as a rbtree called
+pi_waiters. It is protected by a spin lock also in the task structure,
called pi_lock. This lock may also be taken in interrupt context, so when
locking the pi_lock, interrupts must be disabled.
@@ -312,15 +293,12 @@ Mutex owner and flags
The mutex structure contains a pointer to the owner of the mutex. If the
mutex is not owned, this owner is set to NULL. Since all architectures
-have the task structure on at least a four byte alignment (and if this is
-not true, the rtmutex.c code will be broken!), this allows for the two
-least significant bits to be used as flags. This part is also described
-in Documentation/rt-mutex.txt, but will also be briefly described here.
-
-Bit 0 is used as the "Pending Owner" flag. This is described later.
-Bit 1 is used as the "Has Waiters" flags. This is also described later
- in more detail, but is set whenever there are waiters on a mutex.
+have the task structure on at least a two byte alignment (and if this is
+not true, the rtmutex.c code will be broken!), this allows for the least
+significant bit to be used as a flag. Bit 0 is used as the "Has Waiters"
+flag. It's set whenever there are waiters on a mutex.
+See Documentation/locking/rt-mutex.txt for further details.
cmpxchg Tricks
--------------
@@ -359,40 +337,31 @@ Priority adjustments
--------------------
The implementation of the PI code in rtmutex.c has several places that a
-process must adjust its priority. With the help of the pi_list of a
+process must adjust its priority. With the help of the pi_waiters of a
process this is rather easy to know what needs to be adjusted.
-The functions implementing the task adjustments are rt_mutex_adjust_prio,
-__rt_mutex_adjust_prio (same as the former, but expects the task pi_lock
-to already be taken), rt_mutex_getprio, and rt_mutex_setprio.
+The functions implementing the task adjustments are rt_mutex_adjust_prio
+and rt_mutex_setprio. rt_mutex_setprio is only used in rt_mutex_adjust_prio.
-rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio.
+rt_mutex_adjust_prio examines the priority of the task, and the highest
+priority process that is waiting any of mutexes owned by the task. Since
+the pi_waiters of a task holds an order by priority of all the top waiters
+of all the mutexes that the task owns, we simply need to compare the top
+pi waiter to its own normal/deadline priority and take the higher one.
+Then rt_mutex_setprio is called to adjust the priority of the task to the
+new priority. Note that rt_mutex_setprio is defined in kernel/sched/core.c
+to implement the actual change in priority.
-rt_mutex_getprio returns the priority that the task should have. Either the
-task's own normal priority, or if a process of a higher priority is waiting on
-a mutex owned by the task, then that higher priority should be returned.
-Since the pi_list of a task holds an order by priority list of all the top
-waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs
-to compare the top pi waiter to its own normal priority, and return the higher
-priority back.
+(Note: For the "prio" field in task_struct, the lower the number, the
+ higher the priority. A "prio" of 5 is of higher priority than a
+ "prio" of 10.)
-(Note: if looking at the code, you will notice that the lower number of
- prio is returned. This is because the prio field in the task structure
- is an inverse order of the actual priority. So a "prio" of 5 is
- of higher priority than a "prio" of 10.)
-
-__rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the
-result does not equal the task's current priority, then rt_mutex_setprio
-is called to adjust the priority of the task to the new priority.
-Note that rt_mutex_setprio is defined in kernel/sched/core.c to implement the
-actual change in priority.
-
-It is interesting to note that __rt_mutex_adjust_prio can either increase
+It is interesting to note that rt_mutex_adjust_prio can either increase
or decrease the priority of the task. In the case that a higher priority
-process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio
+process has just blocked on a mutex owned by the task, rt_mutex_adjust_prio
would increase/boost the task's priority. But if a higher priority task
were for some reason to leave the mutex (timeout or signal), this same function
-would decrease/unboost the priority of the task. That is because the pi_list
+would decrease/unboost the priority of the task. That is because the pi_waiters
always contains the highest priority task that is waiting on a mutex owned
by the task, so we only need to compare the priority of that top pi waiter
to the normal priority of the given task.
@@ -412,9 +381,10 @@ priorities.
rt_mutex_adjust_prio_chain is called with a task to be checked for PI
(de)boosting (the owner of a mutex that a process is blocking on), a flag to
-check for deadlocking, the mutex that the task owns, and a pointer to a waiter
+check for deadlocking, the mutex that the task owns, a pointer to a waiter
that is the process's waiter struct that is blocked on the mutex (although this
-parameter may be NULL for deboosting).
+parameter may be NULL for deboosting), a pointer to the mutex on which the task
+is blocked, and a top_task as the top waiter of the mutex.
For this explanation, I will not mention deadlock detection. This explanation
will try to stay at a high level.
@@ -424,133 +394,14 @@ that the state of the owner and lock can change when entered into this function.
Before this function is called, the task has already had rt_mutex_adjust_prio
performed on it. This means that the task is set to the priority that it
-should be at, but the plist nodes of the task's waiter have not been updated
-with the new priorities, and that this task may not be in the proper locations
-in the pi_lists and wait_lists that the task is blocked on. This function
+should be at, but the rbtree nodes of the task's waiter have not been updated
+with the new priorities, and this task may not be in the proper locations
+in the pi_waiters and waiters trees that the task is blocked on. This function
solves all that.
-A loop is entered, where task is the owner to be checked for PI changes that
-was passed by parameter (for the first iteration). The pi_lock of this task is
-taken to prevent any more changes to the pi_list of the task. This also
-prevents new tasks from completing the blocking on a mutex that is owned by this
-task.
-
-If the task is not blocked on a mutex then the loop is exited. We are at
-the top of the PI chain.
-
-A check is now done to see if the original waiter (the process that is blocked
-on the current mutex) is the top pi waiter of the task. That is, is this
-waiter on the top of the task's pi_list. If it is not, it either means that
-there is another process higher in priority that is blocked on one of the
-mutexes that the task owns, or that the waiter has just woken up via a signal
-or timeout and has left the PI chain. In either case, the loop is exited, since
-we don't need to do any more changes to the priority of the current task, or any
-task that owns a mutex that this current task is waiting on. A priority chain
-walk is only needed when a new top pi waiter is made to a task.
-
-The next check sees if the task's waiter plist node has the priority equal to
-the priority the task is set at. If they are equal, then we are done with
-the loop. Remember that the function started with the priority of the
-task adjusted, but the plist nodes that hold the task in other processes
-pi_lists have not been adjusted.
-
-Next, we look at the mutex that the task is blocked on. The mutex's wait_lock
-is taken. This is done by a spin_trylock, because the locking order of the
-pi_lock and wait_lock goes in the opposite direction. If we fail to grab the
-lock, the pi_lock is released, and we restart the loop.
-
-Now that we have both the pi_lock of the task as well as the wait_lock of
-the mutex the task is blocked on, we update the task's waiter's plist node
-that is located on the mutex's wait_list.
-
-Now we release the pi_lock of the task.
-
-Next the owner of the mutex has its pi_lock taken, so we can update the
-task's entry in the owner's pi_list. If the task is the highest priority
-process on the mutex's wait_list, then we remove the previous top waiter
-from the owner's pi_list, and replace it with the task.
-
-Note: It is possible that the task was the current top waiter on the mutex,
- in which case the task is not yet on the pi_list of the waiter. This
- is OK, since plist_del does nothing if the plist node is not on any
- list.
-
-If the task was not the top waiter of the mutex, but it was before we
-did the priority updates, that means we are deboosting/lowering the
-task. In this case, the task is removed from the pi_list of the owner,
-and the new top waiter is added.
-
-Lastly, we unlock both the pi_lock of the task, as well as the mutex's
-wait_lock, and continue the loop again. On the next iteration of the
-loop, the previous owner of the mutex will be the task that will be
-processed.
-
-Note: One might think that the owner of this mutex might have changed
- since we just grab the mutex's wait_lock. And one could be right.
- The important thing to remember is that the owner could not have
- become the task that is being processed in the PI chain, since
- we have taken that task's pi_lock at the beginning of the loop.
- So as long as there is an owner of this mutex that is not the same
- process as the tasked being worked on, we are OK.
-
- Looking closely at the code, one might be confused. The check for the
- end of the PI chain is when the task isn't blocked on anything or the
- task's waiter structure "task" element is NULL. This check is
- protected only by the task's pi_lock. But the code to unlock the mutex
- sets the task's waiter structure "task" element to NULL with only
- the protection of the mutex's wait_lock, which was not taken yet.
- Isn't this a race condition if the task becomes the new owner?
-
- The answer is No! The trick is the spin_trylock of the mutex's
- wait_lock. If we fail that lock, we release the pi_lock of the
- task and continue the loop, doing the end of PI chain check again.
-
- In the code to release the lock, the wait_lock of the mutex is held
- the entire time, and it is not let go when we grab the pi_lock of the
- new owner of the mutex. So if the switch of a new owner were to happen
- after the check for end of the PI chain and the grabbing of the
- wait_lock, the unlocking code would spin on the new owner's pi_lock
- but never give up the wait_lock. So the PI chain loop is guaranteed to
- fail the spin_trylock on the wait_lock, release the pi_lock, and
- try again.
-
- If you don't quite understand the above, that's OK. You don't have to,
- unless you really want to make a proof out of it ;)
-
-
-Pending Owners and Lock stealing
---------------------------------
-
-One of the flags in the owner field of the mutex structure is "Pending Owner".
-What this means is that an owner was chosen by the process releasing the
-mutex, but that owner has yet to wake up and actually take the mutex.
-
-Why is this important? Why can't we just give the mutex to another process
-and be done with it?
-
-The PI code is to help with real-time processes, and to let the highest
-priority process run as long as possible with little latencies and delays.
-If a high priority process owns a mutex that a lower priority process is
-blocked on, when the mutex is released it would be given to the lower priority
-process. What if the higher priority process wants to take that mutex again.
-The high priority process would fail to take that mutex that it just gave up
-and it would need to boost the lower priority process to run with full
-latency of that critical section (since the low priority process just entered
-it).
-
-There's no reason a high priority process that gives up a mutex should be
-penalized if it tries to take that mutex again. If the new owner of the
-mutex has not woken up yet, there's no reason that the higher priority process
-could not take that mutex away.
-
-To solve this, we introduced Pending Ownership and Lock Stealing. When a
-new process is given a mutex that it was blocked on, it is only given
-pending ownership. This means that it's the new owner, unless a higher
-priority process comes in and tries to grab that mutex. If a higher priority
-process does come along and wants that mutex, we let the higher priority
-process "steal" the mutex from the pending owner (only if it is still pending)
-and continue with the mutex.
-
+The main operation of this function is summarized by Thomas Gleixner in
+rtmutex.c. See the 'Chain walk basics and protection scope' comment for further
+details.
Taking of a mutex (The walk through)
------------------------------------
@@ -563,14 +414,14 @@ done when we have CMPXCHG enabled (otherwise the fast taking automatically
fails). Only when the owner field of the mutex is NULL can the lock be
taken with the CMPXCHG and nothing else needs to be done.
-If there is contention on the lock, whether it is owned or pending owner
-we go about the slow path (rt_mutex_slowlock).
+If there is contention on the lock, we go about the slow path
+(rt_mutex_slowlock).
The slow path function is where the task's waiter structure is created on
the stack. This is because the waiter structure is only needed for the
scope of this function. The waiter structure holds the nodes to store
-the task on the wait_list of the mutex, and if need be, the pi_list of
-the owner.
+the task on the waiters tree of the mutex, and if need be, the pi_waiters
+tree of the owner.
The wait_lock of the mutex is taken since the slow path of unlocking the
mutex also takes this lock.
@@ -581,102 +432,45 @@ contention).
try_to_take_rt_mutex is used every time the task tries to grab a mutex in the
slow path. The first thing that is done here is an atomic setting of
-the "Has Waiters" flag of the mutex's owner field. Yes, this could really
-be false, because if the mutex has no owner, there are no waiters and
-the current task also won't have any waiters. But we don't have the lock
-yet, so we assume we are going to be a waiter. The reason for this is to
-play nice for those architectures that do have CMPXCHG. By setting this flag
-now, the owner of the mutex can't release the mutex without going into the
-slow unlock path, and it would then need to grab the wait_lock, which this
-code currently holds. So setting the "Has Waiters" flag forces the owner
-to synchronize with this code.
-
-Now that we know that we can't have any races with the owner releasing the
-mutex, we check to see if we can take the ownership. This is done if the
-mutex doesn't have a owner, or if we can steal the mutex from a pending
-owner. Let's look at the situations we have here.
-
- 1) Has owner that is pending
- ----------------------------
-
- The mutex has a owner, but it hasn't woken up and the mutex flag
- "Pending Owner" is set. The first check is to see if the owner isn't the
- current task. This is because this function is also used for the pending
- owner to grab the mutex. When a pending owner wakes up, it checks to see
- if it can take the mutex, and this is done if the owner is already set to
- itself. If so, we succeed and leave the function, clearing the "Pending
- Owner" bit.
-
- If the pending owner is not current, we check to see if the current priority is
- higher than the pending owner. If not, we fail the function and return.
-
- There's also something special about a pending owner. That is a pending owner
- is never blocked on a mutex. So there is no PI chain to worry about. It also
- means that if the mutex doesn't have any waiters, there's no accounting needed
- to update the pending owner's pi_list, since we only worry about processes
- blocked on the current mutex.
-
- If there are waiters on this mutex, and we just stole the ownership, we need
- to take the top waiter, remove it from the pi_list of the pending owner, and
- add it to the current pi_list. Note that at this moment, the pending owner
- is no longer on the list of waiters. This is fine, since the pending owner
- would add itself back when it realizes that it had the ownership stolen
- from itself. When the pending owner tries to grab the mutex, it will fail
- in try_to_take_rt_mutex if the owner field points to another process.
-
- 2) No owner
- -----------
-
- If there is no owner (or we successfully stole the lock), we set the owner
- of the mutex to current, and set the flag of "Has Waiters" if the current
- mutex actually has waiters, or we clear the flag if it doesn't. See, it was
- OK that we set that flag early, since now it is cleared.
-
- 3) Failed to grab ownership
- ---------------------------
-
- The most interesting case is when we fail to take ownership. This means that
- there exists an owner, or there's a pending owner with equal or higher
- priority than the current task.
-
-We'll continue on the failed case.
-
-If the mutex has a timeout, we set up a timer to go off to break us out
-of this mutex if we failed to get it after a specified amount of time.
-
-Now we enter a loop that will continue to try to take ownership of the mutex, or
-fail from a timeout or signal.
-
-Once again we try to take the mutex. This will usually fail the first time
-in the loop, since it had just failed to get the mutex. But the second time
-in the loop, this would likely succeed, since the task would likely be
-the pending owner.
-
-If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done
-here.
-
-The waiter structure has a "task" field that points to the task that is blocked
-on the mutex. This field can be NULL the first time it goes through the loop
-or if the task is a pending owner and had its mutex stolen. If the "task"
-field is NULL then we need to set up the accounting for it.
+the "Has Waiters" flag of the mutex's owner field. By setting this flag
+now, the current owner of the mutex being contended for can't release the mutex
+without going into the slow unlock path, and it would then need to grab the
+wait_lock, which this code currently holds. So setting the "Has Waiters" flag
+forces the current owner to synchronize with this code.
+
+The lock is taken if the following are true:
+ 1) The lock has no owner
+ 2) The current task is the highest priority against all other
+ waiters of the lock
+
+If the task succeeds to acquire the lock, then the task is set as the
+owner of the lock, and if the lock still has waiters, the top_waiter
+(highest priority task waiting on the lock) is added to this task's
+pi_waiters tree.
+
+If the lock is not taken by try_to_take_rt_mutex(), then the
+task_blocks_on_rt_mutex() function is called. This will add the task to
+the lock's waiter tree and propagate the pi chain of the lock as well
+as the lock's owner's pi_waiters tree. This is described in the next
+section.
Task blocks on mutex
--------------------
The accounting of a mutex and process is done with the waiter structure of
the process. The "task" field is set to the process, and the "lock" field
-to the mutex. The plist nodes are initialized to the processes current
-priority.
+to the mutex. The rbtree node of waiter are initialized to the processes
+current priority.
Since the wait_lock was taken at the entry of the slow lock, we can safely
-add the waiter to the wait_list. If the current process is the highest
-priority process currently waiting on this mutex, then we remove the
-previous top waiter process (if it exists) from the pi_list of the owner,
-and add the current process to that list. Since the pi_list of the owner
+add the waiter to the task waiter tree. If the current process is the
+highest priority process currently waiting on this mutex, then we remove the
+previous top waiter process (if it exists) from the pi_waiters of the owner,
+and add the current process to that tree. Since the pi_waiter of the owner
has changed, we call rt_mutex_adjust_prio on the owner to see if the owner
should adjust its priority accordingly.
-If the owner is also blocked on a lock, and had its pi_list changed
+If the owner is also blocked on a lock, and had its pi_waiters changed
(or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead
and run rt_mutex_adjust_prio_chain on the owner, as described earlier.
@@ -686,30 +480,23 @@ mutex (waiter "task" field is not NULL), then we go to sleep (call schedule).
Waking up in the loop
---------------------
-The schedule can then wake up for a few reasons.
- 1) we were given pending ownership of the mutex.
- 2) we received a signal and was TASK_INTERRUPTIBLE
- 3) we had a timeout and was TASK_INTERRUPTIBLE
+The task can then wake up for a couple of reasons:
+ 1) The previous lock owner released the lock, and the task now is top_waiter
+ 2) we received a signal or timeout
-In any of these cases, we continue the loop and once again try to grab the
-ownership of the mutex. If we succeed, we exit the loop, otherwise we continue
-and on signal and timeout, will exit the loop, or if we had the mutex stolen
-we just simply add ourselves back on the lists and go back to sleep.
+In both cases, the task will try again to acquire the lock. If it
+does, then it will take itself off the waiters tree and set itself back
+to the TASK_RUNNING state.
-Note: For various reasons, because of timeout and signals, the steal mutex
- algorithm needs to be careful. This is because the current process is
- still on the wait_list. And because of dynamic changing of priorities,
- especially on SCHED_OTHER tasks, the current process can be the
- highest priority task on the wait_list.
-
-Failed to get mutex on Timeout or Signal
-----------------------------------------
+In first case, if the lock was acquired by another task before this task
+could get the lock, then it will go back to sleep and wait to be woken again.
-If a timeout or signal occurred, the waiter's "task" field would not be
-NULL and the task needs to be taken off the wait_list of the mutex and perhaps
-pi_list of the owner. If this process was a high priority process, then
-the rt_mutex_adjust_prio_chain needs to be executed again on the owner,
-but this time it will be lowering the priorities.
+The second case is only applicable for tasks that are grabbing a mutex
+that can wake up before getting the lock, either due to a signal or
+a timeout (i.e. rt_mutex_timed_futex_lock()). When woken, it will try to
+take the lock again, if it succeeds, then the task will return with the
+lock held, otherwise it will return with -EINTR if the task was woken
+by a signal, or -ETIMEDOUT if it timed out.
Unlocking the Mutex
@@ -739,25 +526,12 @@ owner still needs to make this check. If there are no waiters then the mutex
owner field is set to NULL, the wait_lock is released and nothing more is
needed.
-If there are waiters, then we need to wake one up and give that waiter
-pending ownership.
+If there are waiters, then we need to wake one up.
On the wake up code, the pi_lock of the current owner is taken. The top
-waiter of the lock is found and removed from the wait_list of the mutex
-as well as the pi_list of the current owner. The task field of the new
-pending owner's waiter structure is set to NULL, and the owner field of the
-mutex is set to the new owner with the "Pending Owner" bit set, as well
-as the "Has Waiters" bit if there still are other processes blocked on the
-mutex.
-
-The pi_lock of the previous owner is released, and the new pending owner's
-pi_lock is taken. Remember that this is the trick to prevent the race
-condition in rt_mutex_adjust_prio_chain from adding itself as a waiter
-on the mutex.
-
-We now clear the "pi_blocked_on" field of the new pending owner, and if
-the mutex still has waiters pending, we add the new top waiter to the pi_list
-of the pending owner.
+waiter of the lock is found and removed from the waiters tree of the mutex
+as well as the pi_waiters tree of the current owner. The "Has Waiters" bit is
+marked to prevent lower priority tasks from stealing the lock.
Finally we unlock the pi_lock of the pending owner and wake it up.
@@ -772,10 +546,14 @@ Credits
-------
Author: Steven Rostedt <rostedt@goodmis.org>
+Updated: Alex Shi <alex.shi@linaro.org> - 7/6/2017
-Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap
+Original Reviewers: Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and
+ Randy Dunlap
+Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior
Updates
-------
This document was originally written for 2.6.17-rc3-mm1
+was updated on 4.12
diff --git a/Documentation/locking/rt-mutex.txt b/Documentation/locking/rt-mutex.txt
index 243393d882ee..35793e003041 100644
--- a/Documentation/locking/rt-mutex.txt
+++ b/Documentation/locking/rt-mutex.txt
@@ -28,14 +28,13 @@ magic bullet for poorly designed applications, but it allows
well-designed applications to use userspace locks in critical parts of
an high priority thread, without losing determinism.
-The enqueueing of the waiters into the rtmutex waiter list is done in
+The enqueueing of the waiters into the rtmutex waiter tree is done in
priority order. For same priorities FIFO order is chosen. For each
rtmutex, only the top priority waiter is enqueued into the owner's
-priority waiters list. This list too queues in priority order. Whenever
+priority waiters tree. This tree too queues in priority order. Whenever
the top priority waiter of a task changes (for example it timed out or
-got a signal), the priority of the owner task is readjusted. [The
-priority enqueueing is handled by "plists", see include/linux/plist.h
-for more details.]
+got a signal), the priority of the owner task is readjusted. The
+priority enqueueing is handled by "pi_waiters".
RT-mutexes are optimized for fastpath operations and have no internal
locking overhead when locking an uncontended mutex or unlocking a mutex
@@ -46,34 +45,29 @@ is used]
The state of the rt-mutex is tracked via the owner field of the rt-mutex
structure:
-rt_mutex->owner holds the task_struct pointer of the owner. Bit 0 and 1
-are used to keep track of the "owner is pending" and "rtmutex has
-waiters" state.
+lock->owner holds the task_struct pointer of the owner. Bit 0 is used to
+keep track of the "lock has waiters" state.
- owner bit1 bit0
- NULL 0 0 mutex is free (fast acquire possible)
- NULL 0 1 invalid state
- NULL 1 0 Transitional state*
- NULL 1 1 invalid state
- taskpointer 0 0 mutex is held (fast release possible)
- taskpointer 0 1 task is pending owner
- taskpointer 1 0 mutex is held and has waiters
- taskpointer 1 1 task is pending owner and mutex has waiters
+ owner bit0
+ NULL 0 lock is free (fast acquire possible)
+ NULL 1 lock is free and has waiters and the top waiter
+ is going to take the lock*
+ taskpointer 0 lock is held (fast release possible)
+ taskpointer 1 lock is held and has waiters**
-Pending-ownership handling is a performance optimization:
-pending-ownership is assigned to the first (highest priority) waiter of
-the mutex, when the mutex is released. The thread is woken up and once
-it starts executing it can acquire the mutex. Until the mutex is taken
-by it (bit 0 is cleared) a competing higher priority thread can "steal"
-the mutex which puts the woken up thread back on the waiters list.
+The fast atomic compare exchange based acquire and release is only
+possible when bit 0 of lock->owner is 0.
-The pending-ownership optimization is especially important for the
-uninterrupted workflow of high-prio tasks which repeatedly
-takes/releases locks that have lower-prio waiters. Without this
-optimization the higher-prio thread would ping-pong to the lower-prio
-task [because at unlock time we always assign a new owner].
+(*) It also can be a transitional state when grabbing the lock
+with ->wait_lock is held. To prevent any fast path cmpxchg to the lock,
+we need to set the bit0 before looking at the lock, and the owner may be
+NULL in this small time, hence this can be a transitional state.
-(*) The "mutex has waiters" bit gets set to take the lock. If the lock
-doesn't already have an owner, this bit is quickly cleared if there are
-no waiters. So this is a transitional state to synchronize with looking
-at the owner field of the mutex and the mutex owner releasing the lock.
+(**) There is a small time when bit 0 is set but there are no
+waiters. This can happen when grabbing the lock in the slow path.
+To prevent a cmpxchg of the owner releasing the lock, we need to
+set this bit before looking at the lock.
+
+BTW, there is still technically a "Pending Owner", it's just not called
+that anymore. The pending owner happens to be the top_waiter of a lock
+that has no owner and has been woken up to grab the lock.
diff --git a/Documentation/media/uapi/cec/cec-funcs.rst b/Documentation/media/uapi/cec/cec-funcs.rst
index 5b7630f2e076..6d696cead5cb 100644
--- a/Documentation/media/uapi/cec/cec-funcs.rst
+++ b/Documentation/media/uapi/cec/cec-funcs.rst
@@ -7,7 +7,6 @@ Function Reference
.. toctree::
:maxdepth: 1
- :numbered:
cec-func-open
cec-func-close
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index c4ddfcd5ee32..b759a60624fd 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -498,11 +498,11 @@ And a couple of implicit varieties:
This means that ACQUIRE acts as a minimal "acquire" operation and
RELEASE acts as a minimal "release" operation.
-A subset of the atomic operations described in core-api/atomic_ops.rst have
-ACQUIRE and RELEASE variants in addition to fully-ordered and relaxed (no
-barrier semantics) definitions. For compound atomics performing both a load
-and a store, ACQUIRE semantics apply only to the load and RELEASE semantics
-apply only to the store portion of the operation.
+A subset of the atomic operations described in atomic_t.txt have ACQUIRE and
+RELEASE variants in addition to fully-ordered and relaxed (no barrier
+semantics) definitions. For compound atomics performing both a load and a
+store, ACQUIRE semantics apply only to the load and RELEASE semantics apply
+only to the store portion of the operation.
Memory barriers are only required where there's a possibility of interaction
between two CPUs or between a CPU and a device. If it can be guaranteed that
@@ -594,7 +594,24 @@ between the address load and the data load:
This enforces the occurrence of one of the two implications, and prevents the
third possibility from arising.
-A data-dependency barrier must also order against dependent writes:
+
+[!] Note that this extremely counterintuitive situation arises most easily on
+machines with split caches, so that, for example, one cache bank processes
+even-numbered cache lines and the other bank processes odd-numbered cache
+lines. The pointer P might be stored in an odd-numbered cache line, and the
+variable B might be stored in an even-numbered cache line. Then, if the
+even-numbered bank of the reading CPU's cache is extremely busy while the
+odd-numbered bank is idle, one can see the new value of the pointer P (&B),
+but the old value of the variable B (2).
+
+
+A data-dependency barrier is not required to order dependent writes
+because the CPUs that the Linux kernel supports don't do writes
+until they are certain (1) that the write will actually happen, (2)
+of the location of the write, and (3) of the value to be written.
+But please carefully read the "CONTROL DEPENDENCIES" section and the
+Documentation/RCU/rcu_dereference.txt file: The compiler can and does
+break dependencies in a great many highly creative ways.
CPU 1 CPU 2
=============== ===============
@@ -603,29 +620,19 @@ A data-dependency barrier must also order against dependent writes:
<write barrier>
WRITE_ONCE(P, &B);
Q = READ_ONCE(P);
- <data dependency barrier>
- *Q = 5;
+ WRITE_ONCE(*Q, 5);
-The data-dependency barrier must order the read into Q with the store
-into *Q. This prohibits this outcome:
+Therefore, no data-dependency barrier is required to order the read into
+Q with the store into *Q. In other words, this outcome is prohibited,
+even without a data-dependency barrier:
(Q == &B) && (B == 4)
Please note that this pattern should be rare. After all, the whole point
of dependency ordering is to -prevent- writes to the data structure, along
with the expensive cache misses associated with those writes. This pattern
-can be used to record rare error conditions and the like, and the ordering
-prevents such records from being lost.
-
-
-[!] Note that this extremely counterintuitive situation arises most easily on
-machines with split caches, so that, for example, one cache bank processes
-even-numbered cache lines and the other bank processes odd-numbered cache
-lines. The pointer P might be stored in an odd-numbered cache line, and the
-variable B might be stored in an even-numbered cache line. Then, if the
-even-numbered bank of the reading CPU's cache is extremely busy while the
-odd-numbered bank is idle, one can see the new value of the pointer P (&B),
-but the old value of the variable B (2).
+can be used to record rare error conditions and the like, and the CPUs'
+naturally occurring ordering prevents such records from being lost.
The data dependency barrier is very important to the RCU system,
@@ -1876,8 +1883,7 @@ There are some more advanced barrier functions:
This makes sure that the death mark on the object is perceived to be set
*before* the reference counter is decremented.
- See Documentation/core-api/atomic_ops.rst for more information. See the
- "Atomic operations" subsection for information on where to use these.
+ See Documentation/atomic_{t,bitops}.txt for more information.
(*) lockless_dereference();
@@ -1982,10 +1988,7 @@ for each construct. These operations all imply certain barriers:
ACQUIRE operation has completed.
Memory operations issued before the ACQUIRE may be completed after
- the ACQUIRE operation has completed. An smp_mb__before_spinlock(),
- combined with a following ACQUIRE, orders prior stores against
- subsequent loads and stores. Note that this is weaker than smp_mb()!
- The smp_mb__before_spinlock() primitive is free on many architectures.
+ the ACQUIRE operation has completed.
(2) RELEASE operation implication:
@@ -2503,88 +2506,7 @@ operations are noted specially as some of them imply full memory barriers and
some don't, but they're very heavily relied on as a group throughout the
kernel.
-Any atomic operation that modifies some state in memory and returns information
-about the state (old or new) implies an SMP-conditional general memory barrier
-(smp_mb()) on each side of the actual operation (with the exception of
-explicit lock operations, described later). These include:
-
- xchg();
- atomic_xchg(); atomic_long_xchg();
- atomic_inc_return(); atomic_long_inc_return();
- atomic_dec_return(); atomic_long_dec_return();
- atomic_add_return(); atomic_long_add_return();
- atomic_sub_return(); atomic_long_sub_return();
- atomic_inc_and_test(); atomic_long_inc_and_test();
- atomic_dec_and_test(); atomic_long_dec_and_test();
- atomic_sub_and_test(); atomic_long_sub_and_test();
- atomic_add_negative(); atomic_long_add_negative();
- test_and_set_bit();
- test_and_clear_bit();
- test_and_change_bit();
-
- /* when succeeds */
- cmpxchg();
- atomic_cmpxchg(); atomic_long_cmpxchg();
- atomic_add_unless(); atomic_long_add_unless();
-
-These are used for such things as implementing ACQUIRE-class and RELEASE-class
-operations and adjusting reference counters towards object destruction, and as
-such the implicit memory barrier effects are necessary.
-
-
-The following operations are potential problems as they do _not_ imply memory
-barriers, but might be used for implementing such things as RELEASE-class
-operations:
-
- atomic_set();
- set_bit();
- clear_bit();
- change_bit();
-
-With these the appropriate explicit memory barrier should be used if necessary
-(smp_mb__before_atomic() for instance).
-
-
-The following also do _not_ imply memory barriers, and so may require explicit
-memory barriers under some circumstances (smp_mb__before_atomic() for
-instance):
-
- atomic_add();
- atomic_sub();
- atomic_inc();
- atomic_dec();
-
-If they're used for statistics generation, then they probably don't need memory
-barriers, unless there's a coupling between statistical data.
-
-If they're used for reference counting on an object to control its lifetime,
-they probably don't need memory barriers because either the reference count
-will be adjusted inside a locked section, or the caller will already hold
-sufficient references to make the lock, and thus a memory barrier unnecessary.
-
-If they're used for constructing a lock of some description, then they probably
-do need memory barriers as a lock primitive generally has to do things in a
-specific order.
-
-Basically, each usage case has to be carefully considered as to whether memory
-barriers are needed or not.
-
-The following operations are special locking primitives:
-
- test_and_set_bit_lock();
- clear_bit_unlock();
- __clear_bit_unlock();
-
-These implement ACQUIRE-class and RELEASE-class operations. These should be
-used in preference to other operations when implementing locking primitives,
-because their implementations can be optimised on many architectures.
-
-[!] Note that special memory barrier primitives are available for these
-situations because on some CPUs the atomic instructions used imply full memory
-barriers, and so barrier instructions are superfluous in conjunction with them,
-and in such cases the special barrier primitives will be no-ops.
-
-See Documentation/core-api/atomic_ops.rst for more information.
+See Documentation/atomic_t.txt for more information.
ACCESSING DEVICES
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
index c6beb5f1637f..7a79b3587dd3 100644
--- a/Documentation/networking/00-INDEX
+++ b/Documentation/networking/00-INDEX
@@ -30,8 +30,6 @@ atm.txt
- info on where to get ATM programs and support for Linux.
ax25.txt
- info on using AX.25 and NET/ROM code for Linux
-batman-adv.txt
- - B.A.T.M.A.N routing protocol on top of layer 2 Ethernet Frames.
baycom.txt
- info on the driver for Baycom style amateur radio modems
bonding.txt
diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst
new file mode 100644
index 000000000000..a342b2cc3dc6
--- /dev/null
+++ b/Documentation/networking/batman-adv.rst
@@ -0,0 +1,220 @@
+==========
+batman-adv
+==========
+
+Batman advanced is a new approach to wireless networking which does no longer
+operate on the IP basis. Unlike the batman daemon, which exchanges information
+using UDP packets and sets routing tables, batman-advanced operates on ISO/OSI
+Layer 2 only and uses and routes (or better: bridges) Ethernet Frames. It
+emulates a virtual network switch of all nodes participating. Therefore all
+nodes appear to be link local, thus all higher operating protocols won't be
+affected by any changes within the network. You can run almost any protocol
+above batman advanced, prominent examples are: IPv4, IPv6, DHCP, IPX.
+
+Batman advanced was implemented as a Linux kernel driver to reduce the overhead
+to a minimum. It does not depend on any (other) network driver, and can be used
+on wifi as well as ethernet lan, vpn, etc ... (anything with ethernet-style
+layer 2).
+
+
+Configuration
+=============
+
+Load the batman-adv module into your kernel::
+
+ $ insmod batman-adv.ko
+
+The module is now waiting for activation. You must add some interfaces on which
+batman can operate. After loading the module batman advanced will scan your
+systems interfaces to search for compatible interfaces. Once found, it will
+create subfolders in the ``/sys`` directories of each supported interface,
+e.g.::
+
+ $ ls /sys/class/net/eth0/batman_adv/
+ elp_interval iface_status mesh_iface throughput_override
+
+If an interface does not have the ``batman_adv`` subfolder, it probably is not
+supported. Not supported interfaces are: loopback, non-ethernet and batman's
+own interfaces.
+
+Note: After the module was loaded it will continuously watch for new
+interfaces to verify the compatibility. There is no need to reload the module
+if you plug your USB wifi adapter into your machine after batman advanced was
+initially loaded.
+
+The batman-adv soft-interface can be created using the iproute2 tool ``ip``::
+
+ $ ip link add name bat0 type batadv
+
+To activate a given interface simply attach it to the ``bat0`` interface::
+
+ $ ip link set dev eth0 master bat0
+
+Repeat this step for all interfaces you wish to add. Now batman starts
+using/broadcasting on this/these interface(s).
+
+By reading the "iface_status" file you can check its status::
+
+ $ cat /sys/class/net/eth0/batman_adv/iface_status
+ active
+
+To deactivate an interface you have to detach it from the "bat0" interface::
+
+ $ ip link set dev eth0 nomaster
+
+
+All mesh wide settings can be found in batman's own interface folder::
+
+ $ ls /sys/class/net/bat0/mesh/
+ aggregated_ogms fragmentation isolation_mark routing_algo
+ ap_isolation gw_bandwidth log_level vlan0
+ bonding gw_mode multicast_mode
+ bridge_loop_avoidance gw_sel_class network_coding
+ distributed_arp_table hop_penalty orig_interval
+
+There is a special folder for debugging information::
+
+ $ ls /sys/kernel/debug/batman_adv/bat0/
+ bla_backbone_table log neighbors transtable_local
+ bla_claim_table mcast_flags originators
+ dat_cache nc socket
+ gateways nc_nodes transtable_global
+
+Some of the files contain all sort of status information regarding the mesh
+network. For example, you can view the table of originators (mesh
+participants) with::
+
+ $ cat /sys/kernel/debug/batman_adv/bat0/originators
+
+Other files allow to change batman's behaviour to better fit your requirements.
+For instance, you can check the current originator interval (value in
+milliseconds which determines how often batman sends its broadcast packets)::
+
+ $ cat /sys/class/net/bat0/mesh/orig_interval
+ 1000
+
+and also change its value::
+
+ $ echo 3000 > /sys/class/net/bat0/mesh/orig_interval
+
+In very mobile scenarios, you might want to adjust the originator interval to a
+lower value. This will make the mesh more responsive to topology changes, but
+will also increase the overhead.
+
+
+Usage
+=====
+
+To make use of your newly created mesh, batman advanced provides a new
+interface "bat0" which you should use from this point on. All interfaces added
+to batman advanced are not relevant any longer because batman handles them for
+you. Basically, one "hands over" the data by using the batman interface and
+batman will make sure it reaches its destination.
+
+The "bat0" interface can be used like any other regular interface. It needs an
+IP address which can be either statically configured or dynamically (by using
+DHCP or similar services)::
+
+ NodeA: ip link set up dev bat0
+ NodeA: ip addr add 192.168.0.1/24 dev bat0
+
+ NodeB: ip link set up dev bat0
+ NodeB: ip addr add 192.168.0.2/24 dev bat0
+ NodeB: ping 192.168.0.1
+
+Note: In order to avoid problems remove all IP addresses previously assigned to
+interfaces now used by batman advanced, e.g.::
+
+ $ ip addr flush dev eth0
+
+
+Logging/Debugging
+=================
+
+All error messages, warnings and information messages are sent to the kernel
+log. Depending on your operating system distribution this can be read in one of
+a number of ways. Try using the commands: ``dmesg``, ``logread``, or looking in
+the files ``/var/log/kern.log`` or ``/var/log/syslog``. All batman-adv messages
+are prefixed with "batman-adv:" So to see just these messages try::
+
+ $ dmesg | grep batman-adv
+
+When investigating problems with your mesh network, it is sometimes necessary to
+see more detail debug messages. This must be enabled when compiling the
+batman-adv module. When building batman-adv as part of kernel, use "make
+menuconfig" and enable the option ``B.A.T.M.A.N. debugging``
+(``CONFIG_BATMAN_ADV_DEBUG=y``).
+
+Those additional debug messages can be accessed using a special file in
+debugfs::
+
+ $ cat /sys/kernel/debug/batman_adv/bat0/log
+
+The additional debug output is by default disabled. It can be enabled during
+run time. Following log_levels are defined:
+
+.. flat-table::
+
+ * - 0
+ - All debug output disabled
+ * - 1
+ - Enable messages related to routing / flooding / broadcasting
+ * - 2
+ - Enable messages related to route added / changed / deleted
+ * - 4
+ - Enable messages related to translation table operations
+ * - 8
+ - Enable messages related to bridge loop avoidance
+ * - 16
+ - Enable messages related to DAT, ARP snooping and parsing
+ * - 32
+ - Enable messages related to network coding
+ * - 64
+ - Enable messages related to multicast
+ * - 128
+ - Enable messages related to throughput meter
+ * - 255
+ - Enable all messages
+
+The debug output can be changed at runtime using the file
+``/sys/class/net/bat0/mesh/log_level``. e.g.::
+
+ $ echo 6 > /sys/class/net/bat0/mesh/log_level
+
+will enable debug messages for when routes change.
+
+Counters for different types of packets entering and leaving the batman-adv
+module are available through ethtool::
+
+ $ ethtool --statistics bat0
+
+
+batctl
+======
+
+As batman advanced operates on layer 2, all hosts participating in the virtual
+switch are completely transparent for all protocols above layer 2. Therefore
+the common diagnosis tools do not work as expected. To overcome these problems,
+batctl was created. At the moment the batctl contains ping, traceroute, tcpdump
+and interfaces to the kernel module settings.
+
+For more information, please see the manpage (``man batctl``).
+
+batctl is available on https://www.open-mesh.org/
+
+
+Contact
+=======
+
+Please send us comments, experiences, questions, anything :)
+
+IRC:
+ #batman on irc.freenode.org
+Mailing-list:
+ b.a.t.m.a.n@open-mesh.org (optional subscription at
+ https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
+
+You can also contact the Authors:
+
+* Marek Lindner <mareklindner@neomailbox.ch>
+* Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt
deleted file mode 100644
index ccf94677b240..000000000000
--- a/Documentation/networking/batman-adv.txt
+++ /dev/null
@@ -1,215 +0,0 @@
-BATMAN-ADV
-----------
-
-Batman advanced is a new approach to wireless networking which
-does no longer operate on the IP basis. Unlike the batman daemon,
-which exchanges information using UDP packets and sets routing
-tables, batman-advanced operates on ISO/OSI Layer 2 only and uses
-and routes (or better: bridges) Ethernet Frames. It emulates a
-virtual network switch of all nodes participating. Therefore all
-nodes appear to be link local, thus all higher operating proto-
-cols won't be affected by any changes within the network. You can
-run almost any protocol above batman advanced, prominent examples
-are: IPv4, IPv6, DHCP, IPX.
-
-Batman advanced was implemented as a Linux kernel driver to re-
-duce the overhead to a minimum. It does not depend on any (other)
-network driver, and can be used on wifi as well as ethernet lan,
-vpn, etc ... (anything with ethernet-style layer 2).
-
-
-CONFIGURATION
--------------
-
-Load the batman-adv module into your kernel:
-
-# insmod batman-adv.ko
-
-The module is now waiting for activation. You must add some in-
-terfaces on which batman can operate. After loading the module
-batman advanced will scan your systems interfaces to search for
-compatible interfaces. Once found, it will create subfolders in
-the /sys directories of each supported interface, e.g.
-
-# ls /sys/class/net/eth0/batman_adv/
-# elp_interval iface_status mesh_iface throughput_override
-
-If an interface does not have the "batman_adv" subfolder it prob-
-ably is not supported. Not supported interfaces are: loopback,
-non-ethernet and batman's own interfaces.
-
-Note: After the module was loaded it will continuously watch for
-new interfaces to verify the compatibility. There is no need to
-reload the module if you plug your USB wifi adapter into your ma-
-chine after batman advanced was initially loaded.
-
-The batman-adv soft-interface can be created using the iproute2
-tool "ip"
-
-# ip link add name bat0 type batadv
-
-To activate a given interface simply attach it to the "bat0"
-interface
-
-# ip link set dev eth0 master bat0
-
-Repeat this step for all interfaces you wish to add. Now batman
-starts using/broadcasting on this/these interface(s).
-
-By reading the "iface_status" file you can check its status:
-
-# cat /sys/class/net/eth0/batman_adv/iface_status
-# active
-
-To deactivate an interface you have to detach it from the
-"bat0" interface:
-
-# ip link set dev eth0 nomaster
-
-
-All mesh wide settings can be found in batman's own interface
-folder:
-
-# ls /sys/class/net/bat0/mesh/
-# aggregated_ogms fragmentation isolation_mark routing_algo
-# ap_isolation gw_bandwidth log_level vlan0
-# bonding gw_mode multicast_mode
-# bridge_loop_avoidance gw_sel_class network_coding
-# distributed_arp_table hop_penalty orig_interval
-
-There is a special folder for debugging information:
-
-# ls /sys/kernel/debug/batman_adv/bat0/
-# bla_backbone_table log neighbors transtable_local
-# bla_claim_table mcast_flags originators
-# dat_cache nc socket
-# gateways nc_nodes transtable_global
-
-Some of the files contain all sort of status information regard-
-ing the mesh network. For example, you can view the table of
-originators (mesh participants) with:
-
-# cat /sys/kernel/debug/batman_adv/bat0/originators
-
-Other files allow to change batman's behaviour to better fit your
-requirements. For instance, you can check the current originator
-interval (value in milliseconds which determines how often batman
-sends its broadcast packets):
-
-# cat /sys/class/net/bat0/mesh/orig_interval
-# 1000
-
-and also change its value:
-
-# echo 3000 > /sys/class/net/bat0/mesh/orig_interval
-
-In very mobile scenarios, you might want to adjust the originator
-interval to a lower value. This will make the mesh more respon-
-sive to topology changes, but will also increase the overhead.
-
-
-USAGE
------
-
-To make use of your newly created mesh, batman advanced provides
-a new interface "bat0" which you should use from this point on.
-All interfaces added to batman advanced are not relevant any
-longer because batman handles them for you. Basically, one "hands
-over" the data by using the batman interface and batman will make
-sure it reaches its destination.
-
-The "bat0" interface can be used like any other regular inter-
-face. It needs an IP address which can be either statically con-
-figured or dynamically (by using DHCP or similar services):
-
-# NodeA: ip link set up dev bat0
-# NodeA: ip addr add 192.168.0.1/24 dev bat0
-
-# NodeB: ip link set up dev bat0
-# NodeB: ip addr add 192.168.0.2/24 dev bat0
-# NodeB: ping 192.168.0.1
-
-Note: In order to avoid problems remove all IP addresses previ-
-ously assigned to interfaces now used by batman advanced, e.g.
-
-# ip addr flush dev eth0
-
-
-LOGGING/DEBUGGING
------------------
-
-All error messages, warnings and information messages are sent to
-the kernel log. Depending on your operating system distribution
-this can be read in one of a number of ways. Try using the com-
-mands: dmesg, logread, or looking in the files /var/log/kern.log
-or /var/log/syslog. All batman-adv messages are prefixed with
-"batman-adv:" So to see just these messages try
-
-# dmesg | grep batman-adv
-
-When investigating problems with your mesh network it is some-
-times necessary to see more detail debug messages. This must be
-enabled when compiling the batman-adv module. When building bat-
-man-adv as part of kernel, use "make menuconfig" and enable the
-option "B.A.T.M.A.N. debugging".
-
-Those additional debug messages can be accessed using a special
-file in debugfs
-
-# cat /sys/kernel/debug/batman_adv/bat0/log
-
-The additional debug output is by default disabled. It can be en-
-abled during run time. Following log_levels are defined:
-
- 0 - All debug output disabled
- 1 - Enable messages related to routing / flooding / broadcasting
- 2 - Enable messages related to route added / changed / deleted
- 4 - Enable messages related to translation table operations
- 8 - Enable messages related to bridge loop avoidance
- 16 - Enable messages related to DAT, ARP snooping and parsing
- 32 - Enable messages related to network coding
- 64 - Enable messages related to multicast
-128 - Enable messages related to throughput meter
-255 - Enable all messages
-
-The debug output can be changed at runtime using the file
-/sys/class/net/bat0/mesh/log_level. e.g.
-
-# echo 6 > /sys/class/net/bat0/mesh/log_level
-
-will enable debug messages for when routes change.
-
-Counters for different types of packets entering and leaving the
-batman-adv module are available through ethtool:
-
-# ethtool --statistics bat0
-
-
-BATCTL
-------
-
-As batman advanced operates on layer 2 all hosts participating in
-the virtual switch are completely transparent for all protocols
-above layer 2. Therefore the common diagnosis tools do not work
-as expected. To overcome these problems batctl was created. At
-the moment the batctl contains ping, traceroute, tcpdump and
-interfaces to the kernel module settings.
-
-For more information, please see the manpage (man batctl).
-
-batctl is available on https://www.open-mesh.org/
-
-
-CONTACT
--------
-
-Please send us comments, experiences, questions, anything :)
-
-IRC: #batman on irc.freenode.org
-Mailing-list: b.a.t.m.a.n@open-mesh.org (optional subscription
- at https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n)
-
-You can also contact the Authors:
-
-Marek Lindner <mareklindner@neomailbox.ch>
-Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/dpaa.txt b/Documentation/networking/dpaa.txt
index 76e016d4d344..f88194f71c54 100644
--- a/Documentation/networking/dpaa.txt
+++ b/Documentation/networking/dpaa.txt
@@ -13,6 +13,7 @@ Contents
- Configuring DPAA Ethernet in your kernel
- DPAA Ethernet Frame Processing
- DPAA Ethernet Features
+ - DPAA IRQ Affinity and Receive Side Scaling
- Debugging
DPAA Ethernet Overview
@@ -147,7 +148,10 @@ gradually.
The driver has Rx and Tx checksum offloading for UDP and TCP. Currently the Rx
checksum offload feature is enabled by default and cannot be controlled through
-ethtool.
+ethtool. Also, rx-flow-hash and rx-hashing was added. The addition of RSS
+provides a big performance boost for the forwarding scenarios, allowing
+different traffic flows received by one interface to be processed by different
+CPUs in parallel.
The driver has support for multiple prioritized Tx traffic classes. Priorities
range from 0 (lowest) to 3 (highest). These are mapped to HW workqueues with
@@ -166,6 +170,68 @@ classes as follows:
tc qdisc add dev <int> root handle 1: \
mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1
+DPAA IRQ Affinity and Receive Side Scaling
+==========================================
+
+Traffic coming on the DPAA Rx queues or on the DPAA Tx confirmation
+queues is seen by the CPU as ingress traffic on a certain portal.
+The DPAA QMan portal interrupts are affined each to a certain CPU.
+The same portal interrupt services all the QMan portal consumers.
+
+By default the DPAA Ethernet driver enables RSS, making use of the
+DPAA FMan Parser and Keygen blocks to distribute traffic on 128
+hardware frame queues using a hash on IP v4/v6 source and destination
+and L4 source and destination ports, in present in the received frame.
+When RSS is disabled, all traffic received by a certain interface is
+received on the default Rx frame queue. The default DPAA Rx frame
+queues are configured to put the received traffic into a pool channel
+that allows any available CPU portal to dequeue the ingress traffic.
+The default frame queues have the HOLDACTIVE option set, ensuring that
+traffic bursts from a certain queue are serviced by the same CPU.
+This ensures a very low rate of frame reordering. A drawback of this
+is that only one CPU at a time can service the traffic received by a
+certain interface when RSS is not enabled.
+
+To implement RSS, the DPAA Ethernet driver allocates an extra set of
+128 Rx frame queues that are configured to dedicated channels, in a
+round-robin manner. The mapping of the frame queues to CPUs is now
+hardcoded, there is no indirection table to move traffic for a certain
+FQ (hash result) to another CPU. The ingress traffic arriving on one
+of these frame queues will arrive at the same portal and will always
+be processed by the same CPU. This ensures intra-flow order preservation
+and workload distribution for multiple traffic flows.
+
+RSS can be turned off for a certain interface using ethtool, i.e.
+
+ # ethtool -N fm1-mac9 rx-flow-hash tcp4 ""
+
+To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6:
+
+ # ethtool -N fm1-mac9 rx-flow-hash udp4 sfdn
+
+There is no independent control for individual protocols, any command
+run for one of tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 is
+going to control the rx-flow-hashing for all protocols on that interface.
+
+Besides using the FMan Keygen computed hash for spreading traffic on the
+128 Rx FQs, the DPAA Ethernet driver also sets the skb hash value when
+the NETIF_F_RXHASH feature is on (active by default). This can be turned
+on or off through ethtool, i.e.:
+
+ # ethtool -K fm1-mac9 rx-hashing off
+ # ethtool -k fm1-mac9 | grep hash
+ receive-hashing: off
+ # ethtool -K fm1-mac9 rx-hashing on
+ Actual changes:
+ receive-hashing: on
+ # ethtool -k fm1-mac9 | grep hash
+ receive-hashing: on
+
+Please note that Rx hashing depends upon the rx-flow-hashing being on
+for that interface - turning off rx-flow-hashing will also disable the
+rx-hashing (without ethtool reporting it as off as that depends on the
+NETIF_F_RXHASH feature flag).
+
Debugging
=========
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index b69b205501de..e5e33bac2068 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply
before a conversion to the new layout is being done behind the scenes!
Currently, the classic BPF format is being used for JITing on most 32-bit
-architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT
-compilation from eBPF instruction set.
+architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64, arm32 perform
+JIT compilation from eBPF instruction set.
Some core changes of the new internal format:
@@ -793,7 +793,7 @@ Some core changes of the new internal format:
bpf_exit
After the call the registers R1-R5 contain junk values and cannot be read.
- In the future an eBPF verifier can be used to validate internal BPF programs.
+ An in-kernel eBPF verifier is used to validate internal BPF programs.
Also in the new design, eBPF is limited to 4096 insns, which means that any
program will terminate quickly and will only call a fixed number of kernel
@@ -906,6 +906,10 @@ If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
BPF_JSGE 0x70 /* eBPF only: signed '>=' */
BPF_CALL 0x80 /* eBPF only: function call */
BPF_EXIT 0x90 /* eBPF only: function return */
+ BPF_JLT 0xa0 /* eBPF only: unsigned '<' */
+ BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */
+ BPF_JSLT 0xc0 /* eBPF only: signed '<' */
+ BPF_JSLE 0xd0 /* eBPF only: signed '<=' */
So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
and eBPF. There are only two registers in classic BPF, so it means A += X.
@@ -1017,7 +1021,7 @@ At the start of the program the register R1 contains a pointer to context
and has type PTR_TO_CTX.
If verifier sees an insn that does R2=R1, then R2 has now type
PTR_TO_CTX as well and can be used on the right hand side of expression.
-If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE,
+If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
since addition of two valid pointers makes invalid pointer.
(In 'secure' mode verifier will reject any type of pointer arithmetic to make
sure that kernel addresses don't leak to unprivileged users)
@@ -1039,7 +1043,7 @@ is a correct program. If there was R1 instead of R6, it would have
been rejected.
load/store instructions are allowed only with registers of valid types, which
-are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked.
+are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
For example:
bpf_mov R1 = 1
bpf_mov R2 = 2
@@ -1058,7 +1062,7 @@ intends to load a word from address R6 + 8 and store it into R0
If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know
that offset 8 of size 4 bytes can be accessed for reading, otherwise
the verifier will reject the program.
-If R6=FRAME_PTR, then access should be aligned and be within
+If R6=PTR_TO_STACK, then access should be aligned and be within
stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8,
so it will fail verification, since it's out of bounds.
@@ -1069,7 +1073,7 @@ For example:
bpf_ld R0 = *(u32 *)(R10 - 4)
bpf_exit
is invalid program.
-Though R10 is correct read-only register and has type FRAME_PTR
+Though R10 is correct read-only register and has type PTR_TO_STACK
and R10 - 4 is within stack bounds, there were no stores into that location.
Pointer register spill/fill is tracked as well, since four (R6-R9)
@@ -1094,6 +1098,71 @@ all use cases.
See details of eBPF verifier in kernel/bpf/verifier.c
+Register value tracking
+-----------------------
+In order to determine the safety of an eBPF program, the verifier must track
+the range of possible values in each register and also in each stack slot.
+This is done with 'struct bpf_reg_state', defined in include/linux/
+bpf_verifier.h, which unifies tracking of scalar and pointer values. Each
+register state has a type, which is either NOT_INIT (the register has not been
+written to), SCALAR_VALUE (some value which is not usable as a pointer), or a
+pointer type. The types of pointers describe their base, as follows:
+ PTR_TO_CTX Pointer to bpf_context.
+ CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic
+ on these pointers is forbidden.
+ PTR_TO_MAP_VALUE Pointer to the value stored in a map element.
+ PTR_TO_MAP_VALUE_OR_NULL
+ Either a pointer to a map value, or NULL; map accesses
+ (see section 'eBPF maps', below) return this type,
+ which becomes a PTR_TO_MAP_VALUE when checked != NULL.
+ Arithmetic on these pointers is forbidden.
+ PTR_TO_STACK Frame pointer.
+ PTR_TO_PACKET skb->data.
+ PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden.
+However, a pointer may be offset from this base (as a result of pointer
+arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable
+offset'. The former is used when an exactly-known value (e.g. an immediate
+operand) is added to a pointer, while the latter is used for values which are
+not exactly known. The variable offset is also used in SCALAR_VALUEs, to track
+the range of possible values in the register.
+The verifier's knowledge about the variable offset consists of:
+* minimum and maximum values as unsigned
+* minimum and maximum values as signed
+* knowledge of the values of individual bits, in the form of a 'tnum': a u64
+'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown;
+1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both
+mask and value; no bit should ever be 1 in both. For example, if a byte is read
+into a register from memory, the register's top 56 bits are known zero, while
+the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
+then OR this with 0x40, we get (0x40; 0xcf), then if we add 1 we get (0x0;
+0x1ff), because of potential carries.
+Besides arithmetic, the register state can also be updated by conditional
+branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
+it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
+branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or
+BPF_JSGE) would instead update the signed minimum/maximum values. Information
+from the signed and unsigned bounds can be combined; for instance if a value is
+first tested < 8 and then tested s> 4, the verifier will conclude that the value
+is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
+PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
+pointers sharing that same variable offset. This is important for packet range
+checks: after adding some variable to a packet pointer, if you then copy it to
+another register and (say) add a constant 4, both registers will share the same
+'id' but one will have a fixed offset of +4. Then if it is bounds-checked and
+found to be less than a PTR_TO_PACKET_END, the other register is now known to
+have a safe range of at least 4 bytes. See 'Direct packet access', below, for
+more on PTR_TO_PACKET ranges.
+The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
+the pointer returned from a map lookup. This means that when one copy is
+checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.
+As well as range-checking, the tracked information is also used for enforcing
+alignment of pointer accesses. For instance, on most systems the packet pointer
+is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump
+over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting
+pointer will have a variable offset known to be 4n+2 for some n, so adding the 2
+bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through
+that pointer are safe.
+
Direct packet access
--------------------
In cls_bpf and act_bpf programs the verifier allows direct access to the packet
@@ -1121,7 +1190,7 @@ it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14)
which is zero bytes.
More complex packet access may look like:
- R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
+ R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp
6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */
7: r4 = *(u8 *)(r3 +12)
8: r4 *= 14
@@ -1135,26 +1204,31 @@ More complex packet access may look like:
16: r2 += 8
17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */
18: if r2 > r1 goto pc+2
- R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp
+ R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp
19: r1 = *(u8 *)(r3 +4)
The state of the register R3 is R3=pkt(id=2,off=0,r=8)
id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some
offset within a packet and since the program author did
'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8).
-The verifier only allows 'add' operation on packet registers. Any other
-operation will set the register state to 'unknown_value' and it won't be
+The verifier only allows 'add'/'sub' operations on packet registers. Any other
+operation will set the register state to 'SCALAR_VALUE' and it won't be
available for direct packet access.
Operation 'r3 += rX' may overflow and become less than original skb->data,
-therefore the verifier has to prevent that. So it tracks the number of
-upper zero bits in all 'uknown_value' registers, so when it sees
-'r3 += rX' instruction and rX is more than 16-bit value, it will error as:
-"cannot add integer value with N upper zero bits to ptr_to_packet"
+therefore the verifier has to prevent that. So when it sees 'r3 += rX'
+instruction and rX is more than 16-bit value, any subsequent bounds-check of r3
+against skb->data_end will not give us 'range' information, so attempts to read
+through the pointer will give "invalid access to packet" error.
Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is
-R4=inv56 which means that upper 56 bits on the register are guaranteed
-to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since
-multiplying 8-bit value by constant 14 will keep upper 52 bits as zero.
-Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign
-extending. This logic is implemented in evaluate_reg_alu() function.
+R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits
+of the register are guaranteed to be zero, and nothing is known about the lower
+8 bits. After insn 'r4 *= 14' the state becomes
+R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit
+value by constant 14 will keep upper 52 bits as zero, also the least significant
+bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make
+R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign
+extending. This logic is implemented in adjust_reg_min_max_vals() function,
+which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice
+versa) and adjust_scalar_min_max_vals() for operations on two scalars.
The end result is that bpf program author can access packet directly
using normal C code as:
@@ -1214,6 +1288,22 @@ The map is defined by:
. key size in bytes
. value size in bytes
+Pruning
+-------
+The verifier does not actually walk all possible paths through the program. For
+each new branch to analyse, the verifier looks at all the states it's previously
+been in when at this instruction. If any of them contain the current state as a
+subset, the branch is 'pruned' - that is, the fact that the previous state was
+accepted implies the current state would be as well. For instance, if in the
+previous state, r1 held a packet-pointer, and in the current state, r1 holds a
+packet-pointer with a range as long or longer and at least as strict an
+alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't
+have been used by any path from that point, so any value in r2 (including
+another NOT_INIT) is safe. The implementation is in the function regsafe().
+Pruning considers not only the registers but also the stack (and any spilled
+registers it may hold). They must all be safe for the branch to be pruned.
+This is implemented in states_equal().
+
Understanding eBPF verifier messages
------------------------------------
diff --git a/Documentation/networking/hinic.txt b/Documentation/networking/hinic.txt
new file mode 100644
index 000000000000..989366a4039c
--- /dev/null
+++ b/Documentation/networking/hinic.txt
@@ -0,0 +1,125 @@
+Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family
+============================================================
+
+Overview:
+=========
+HiNIC is a network interface card for the Data Center Area.
+
+The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.).
+The driver supports also a negotiated and extendable feature set.
+
+Some HiNIC devices support SR-IOV. This driver is used for Physical Function
+(PF).
+
+HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and
+adaptive interrupt moderation.
+
+HiNIC devices support also various offload features such as checksum offload,
+TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and
+LRO(Large Receive Offload).
+
+
+Supported PCI vendor ID/device IDs:
+===================================
+
+19e5:1822 - HiNIC PF
+
+
+Driver Architecture and Source Code:
+====================================
+
+hinic_dev - Implement a Logical Network device that is independent from
+specific HW details about HW data structure formats.
+
+hinic_hwdev - Implement the HW details of the device and include the components
+for accessing the PCI NIC.
+
+hinic_hwdev contains the following components:
+===============================================
+
+HW Interface:
+=============
+
+The interface for accessing the pci device (DMA memory and PCI BARs).
+(hinic_hw_if.c, hinic_hw_if.h)
+
+Configuration Status Registers Area that describes the HW Registers on the
+configuration and status BAR0. (hinic_hw_csr.h)
+
+MGMT components:
+================
+
+Asynchronous Event Queues(AEQs) - The event queues for receiving messages from
+the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h)
+
+Application Programmable Interface commands(API CMD) - Interface for sending
+MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h)
+
+Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT
+commands to the card and receives notifications from the MGMT modules on the
+card by AEQs. Also set the addresses of the IO CMDQs in HW.
+(hinic_hw_mgmt.c, hinic_hw_mgmt.h)
+
+IO components:
+==============
+
+Completion Event Queues(CEQs) - The completion Event Queues that describe IO
+tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h)
+
+Work Queues(WQ) - Contain the memory and operations for use by CMD queues and
+the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains
+pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs).
+(hinic_hw_wq.c, hinic_hw_wq.h)
+
+Command Queues(CMDQ) - The queues for sending commands for IO management and is
+used to set the QPs addresses in HW. The commands completion events are
+accumulated on the CEQ that is configured to receive the CMDQ completion events.
+(hinic_hw_cmdq.c, hinic_hw_cmdq.h)
+
+Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting
+Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h)
+
+IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h)
+
+HW device:
+==========
+
+HW device - de/constructs the HW Interface, the MGMT components on the
+initialization of the driver and the IO components on the case of Interface
+UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h)
+
+
+hinic_dev contains the following components:
+===============================================
+
+PCI ID table - Contains the supported PCI Vendor/Device IDs.
+(hinic_pci_tbl.h)
+
+Port Commands - Send commands to the HW device for port management
+(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h)
+
+Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit.
+The Logical Tx queue is not dependent on the format of the HW Send Queue.
+(hinic_tx.c, hinic_tx.h)
+
+Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive.
+The Logical Rx queue is not dependent on the format of the HW Receive Queue.
+(hinic_rx.c, hinic_rx.h)
+
+hinic_dev - de/constructs the Logical Tx and Rx Queues.
+(hinic_main.c, hinic_dev.h)
+
+
+Miscellaneous:
+=============
+
+Common functions that are used by HW and Logical Device.
+(hinic_common.c, hinic_common.h)
+
+
+Support
+=======
+
+If an issue is identified with the released source code on the supported kernel
+with a supported adapter, email the specific information related to the issue to
+aviad.krawczyk@huawei.com.
diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt
index c4114346f054..057e9fdbfac9 100644
--- a/Documentation/networking/ieee802154.txt
+++ b/Documentation/networking/ieee802154.txt
@@ -84,17 +84,17 @@ Device drivers API
==================
The include/net/mac802154.h defines following functions:
- - struct ieee802154_dev *ieee802154_alloc_device
- (size_t priv_size, struct ieee802154_ops *ops):
- allocation of IEEE 802.15.4 compatible device
+ - struct ieee802154_hw *
+ ieee802154_alloc_hw(size_t priv_data_len, const struct ieee802154_ops *ops):
+ allocation of IEEE 802.15.4 compatible hardware device
- - void ieee802154_free_device(struct ieee802154_dev *dev):
- freeing allocated device
+ - void ieee802154_free_hw(struct ieee802154_hw *hw):
+ freeing allocated hardware device
- - int ieee802154_register_device(struct ieee802154_dev *dev):
- register PHY in the system
+ - int ieee802154_register_hw(struct ieee802154_hw *hw):
+ register PHY which is the allocated hardware device, in the system
- - void ieee802154_unregister_device(struct ieee802154_dev *dev):
+ - void ieee802154_unregister_hw(struct ieee802154_hw *hw):
freeing registered PHY
Moreover IEEE 802.15.4 device operations structure should be filled.
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index b5bd87e01f52..66e620866245 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -6,6 +6,7 @@ Contents:
.. toctree::
:maxdepth: 2
+ batman-adv
kapi
z8530book
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 974ab47ae53a..b3345d0fe0a6 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -109,7 +109,10 @@ neigh/default/unres_qlen_bytes - INTEGER
queued for each unresolved address by other network layers.
(added in linux 3.3)
Setting negative value is meaningless and will return error.
- Default: 65536 Bytes(64KB)
+ Default: SK_WMEM_MAX, (same as net.core.wmem_default).
+ Exact value depends on architecture and kernel options,
+ but should be enough to allow queuing 256 packets
+ of medium size.
neigh/default/unres_qlen - INTEGER
The maximum number of packets which may be queued for each
@@ -119,7 +122,7 @@ neigh/default/unres_qlen - INTEGER
unexpected packet loss. The current default value is calculated
according to default value of unres_qlen_bytes and true size of
packet.
- Default: 31
+ Default: 101
mtu_expires - INTEGER
Time, in seconds, that cached PMTU information is kept.
@@ -353,12 +356,7 @@ tcp_l3mdev_accept - BOOLEAN
compiled with CONFIG_NET_L3_MASTER_DEV.
tcp_low_latency - BOOLEAN
- If set, the TCP stack makes decisions that prefer lower
- latency as opposed to higher throughput. By default, this
- option is not set meaning that higher throughput is preferred.
- An example of an application where this default should be
- changed would be a Beowulf compute cluster.
- Default: 0
+ This is a legacy option, it has no effect anymore.
tcp_max_orphans - INTEGER
Maximal number of TCP sockets not attached to any user file handle,
@@ -1291,8 +1289,7 @@ tag - INTEGER
xfrm4_gc_thresh - INTEGER
The threshold at which we will start garbage collecting for IPv4
destination cache entries. At twice this value the system will
- refuse new allocations. The value must be set below the flowcache
- limit (4096 * number of online cpus) to take effect.
+ refuse new allocations.
igmp_link_local_mcast_reports - BOOLEAN
Enable IGMP reports for link local multicast groups in the
@@ -1356,6 +1353,15 @@ flowlabel_state_ranges - BOOLEAN
FALSE: disabled
Default: true
+flowlabel_reflect - BOOLEAN
+ Automatically reflect the flow label. Needed for Path MTU
+ Discovery to work with Equal Cost Multipath Routing in anycast
+ environments. See RFC 7690 and:
+ https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01
+ TRUE: enabled
+ FALSE: disabled
+ Default: FALSE
+
anycast_src_echo_reply - BOOLEAN
Controls the use of anycast addresses as source addresses for ICMPv6
echo reply
@@ -1778,8 +1784,7 @@ ratelimit - INTEGER
xfrm6_gc_thresh - INTEGER
The threshold at which we will start garbage collecting for IPv6
destination cache entries. At twice this value the system will
- refuse new allocations. The value must be set below the flowcache
- limit (4096 * number of online cpus) to take effect.
+ refuse new allocations.
IPv6 Update by:
diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst
new file mode 100644
index 000000000000..77f6d7e25cfd
--- /dev/null
+++ b/Documentation/networking/msg_zerocopy.rst
@@ -0,0 +1,257 @@
+
+============
+MSG_ZEROCOPY
+============
+
+Intro
+=====
+
+The MSG_ZEROCOPY flag enables copy avoidance for socket send calls.
+The feature is currently implemented for TCP sockets.
+
+
+Opportunity and Caveats
+-----------------------
+
+Copying large buffers between user process and kernel can be
+expensive. Linux supports various interfaces that eschew copying,
+such as sendpage and splice. The MSG_ZEROCOPY flag extends the
+underlying copy avoidance mechanism to common socket send calls.
+
+Copy avoidance is not a free lunch. As implemented, with page pinning,
+it replaces per byte copy cost with page accounting and completion
+notification overhead. As a result, MSG_ZEROCOPY is generally only
+effective at writes over around 10 KB.
+
+Page pinning also changes system call semantics. It temporarily shares
+the buffer between process and network stack. Unlike with copying, the
+process cannot immediately overwrite the buffer after system call
+return without possibly modifying the data in flight. Kernel integrity
+is not affected, but a buggy program can possibly corrupt its own data
+stream.
+
+The kernel returns a notification when it is safe to modify data.
+Converting an existing application to MSG_ZEROCOPY is not always as
+trivial as just passing the flag, then.
+
+
+More Info
+---------
+
+Much of this document was derived from a longer paper presented at
+netdev 2.1. For more in-depth information see that paper and talk,
+the excellent reporting over at LWN.net or read the original code.
+
+ paper, slides, video
+ https://netdevconf.org/2.1/session.html?debruijn
+
+ LWN article
+ https://lwn.net/Articles/726917/
+
+ patchset
+ [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY
+ http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com
+
+
+Interface
+=========
+
+Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy
+avoidance, but not the only one.
+
+Socket Setup
+------------
+
+The kernel is permissive when applications pass undefined flags to the
+send system call. By default it simply ignores these. To avoid enabling
+copy avoidance mode for legacy processes that accidentally already pass
+this flag, a process must first signal intent by setting a socket option:
+
+::
+
+ if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one)))
+ error(1, errno, "setsockopt zerocopy");
+
+
+Transmission
+------------
+
+The change to send (or sendto, sendmsg, sendmmsg) itself is trivial.
+Pass the new flag.
+
+::
+
+ ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY);
+
+A zerocopy failure will return -1 with errno ENOBUFS. This happens if
+the socket option was not set, the socket exceeds its optmem limit or
+the user exceeds its ulimit on locked pages.
+
+
+Mixing copy avoidance and copying
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Many workloads have a mixture of large and small buffers. Because copy
+avoidance is more expensive than copying for small packets, the
+feature is implemented as a flag. It is safe to mix calls with the flag
+with those without.
+
+
+Notifications
+-------------
+
+The kernel has to notify the process when it is safe to reuse a
+previously passed buffer. It queues completion notifications on the
+socket error queue, akin to the transmit timestamping interface.
+
+The notification itself is a simple scalar value. Each socket
+maintains an internal unsigned 32-bit counter. Each send call with
+MSG_ZEROCOPY that successfully sends data increments the counter. The
+counter is not incremented on failure or if called with length zero.
+The counter counts system call invocations, not bytes. It wraps after
+UINT_MAX calls.
+
+
+Notification Reception
+~~~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates the API. In the simplest case, each
+send syscall is followed by a poll and recvmsg on the error queue.
+
+Reading from the error queue is always a non-blocking operation. The
+poll call is there to block until an error is outstanding. It will set
+POLLERR in its output flags. That flag does not have to be set in the
+events field. Errors are signaled unconditionally.
+
+::
+
+ pfd.fd = fd;
+ pfd.events = 0;
+ if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0)
+ error(1, errno, "poll");
+
+ ret = recvmsg(fd, &msg, MSG_ERRQUEUE);
+ if (ret == -1)
+ error(1, errno, "recvmsg");
+
+ read_notification(msg);
+
+The example is for demonstration purpose only. In practice, it is more
+efficient to not wait for notifications, but read without blocking
+every couple of send calls.
+
+Notifications can be processed out of order with other operations on
+the socket. A socket that has an error queued would normally block
+other operations until the error is read. Zerocopy notifications have
+a zero error code, however, to not block send and recv calls.
+
+
+Notification Batching
+~~~~~~~~~~~~~~~~~~~~~
+
+Multiple outstanding packets can be read at once using the recvmmsg
+call. This is often not needed. In each message the kernel returns not
+a single value, but a range. It coalesces consecutive notifications
+while one is outstanding for reception on the error queue.
+
+When a new notification is about to be queued, it checks whether the
+new value extends the range of the notification at the tail of the
+queue. If so, it drops the new notification packet and instead increases
+the range upper value of the outstanding notification.
+
+For protocols that acknowledge data in-order, like TCP, each
+notification can be squashed into the previous one, so that no more
+than one notification is outstanding at any one point.
+
+Ordered delivery is the common case, but not guaranteed. Notifications
+may arrive out of order on retransmission and socket teardown.
+
+
+Notification Parsing
+~~~~~~~~~~~~~~~~~~~~
+
+The below snippet demonstrates how to parse the control message: the
+read_notification() call in the previous snippet. A notification
+is encoded in the standard error format, sock_extended_err.
+
+The level and type fields in the control data are protocol family
+specific, IP_RECVERR or IPV6_RECVERR.
+
+Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero,
+as explained before, to avoid blocking read and write system calls on
+the socket.
+
+The 32-bit notification range is encoded as [ee_info, ee_data]. This
+range is inclusive. Other fields in the struct must be treated as
+undefined, bar for ee_code, as discussed below.
+
+::
+
+ struct sock_extended_err *serr;
+ struct cmsghdr *cm;
+
+ cm = CMSG_FIRSTHDR(msg);
+ if (cm->cmsg_level != SOL_IP &&
+ cm->cmsg_type != IP_RECVERR)
+ error(1, 0, "cmsg");
+
+ serr = (void *) CMSG_DATA(cm);
+ if (serr->ee_errno != 0 ||
+ serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY)
+ error(1, 0, "serr");
+
+ printf("completed: %u..%u\n", serr->ee_info, serr->ee_data);
+
+
+Deferred copies
+~~~~~~~~~~~~~~~
+
+Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy
+avoidance, and a contract that the kernel will queue a completion
+notification. It is not a guarantee that the copy is elided.
+
+Copy avoidance is not always feasible. Devices that do not support
+scatter-gather I/O cannot send packets made up of kernel generated
+protocol headers plus zerocopy user data. A packet may need to be
+converted to a private copy of data deep in the stack, say to compute
+a checksum.
+
+In all these cases, the kernel returns a completion notification when
+it releases its hold on the shared pages. That notification may arrive
+before the (copied) data is fully transmitted. A zerocopy completion
+notification is not a transmit completion notification, therefore.
+
+Deferred copies can be more expensive than a copy immediately in the
+system call, if the data is no longer warm in the cache. The process
+also incurs notification processing cost for no benefit. For this
+reason, the kernel signals if data was completed with a copy, by
+setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return.
+A process may use this signal to stop passing flag MSG_ZEROCOPY on
+subsequent requests on the same socket.
+
+
+Implementation
+==============
+
+Loopback
+--------
+
+Data sent to local sockets can be queued indefinitely if the receive
+process does not read its socket. Unbound notification latency is not
+acceptable. For this reason all packets generated with MSG_ZEROCOPY
+that are looped to a local socket will incur a deferred copy. This
+includes looping onto packet sockets (e.g., tcpdump) and tun devices.
+
+
+Testing
+=======
+
+More realistic example code can be found in the kernel source under
+tools/testing/selftests/net/msg_zerocopy.c.
+
+Be cognizant of the loopback constraint. The test can be run between
+a pair of hosts. But if run between a local pair of processes, for
+instance when run with msg_zerocopy.sh between a veth pair across
+namespaces, the test will not show any improvement. For testing, the
+loopback restriction can be temporarily relaxed by making
+skb_orphan_frags_rx identical to skb_orphan_frags.
diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt
index 247a30ba8e17..cfc66ea72329 100644
--- a/Documentation/networking/netdev-FAQ.txt
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -111,6 +111,14 @@ A: Generally speaking, the patches get triaged quickly (in less than 48h).
patch is a good way to ensure your patch is ignored or pushed to
the bottom of the priority list.
+Q: I submitted multiple versions of the patch series, should I directly update
+ patchwork for the previous versions of these patch series?
+
+A: No, please don't interfere with the patch status on patchwork, leave it to
+ the maintainer to figure out what is the most recent and current version that
+ should be applied. If there is any doubt, the maintainer will reply and ask
+ what should be done.
+
Q: How can I tell what patches are queued up for backporting to the
various stable releases?
diff --git a/Documentation/networking/netvsc.txt b/Documentation/networking/netvsc.txt
new file mode 100644
index 000000000000..93560fb1170a
--- /dev/null
+++ b/Documentation/networking/netvsc.txt
@@ -0,0 +1,75 @@
+Hyper-V network driver
+======================
+
+Compatibility
+=============
+
+This driver is compatible with Windows Server 2012 R2, 2016 and
+Windows 10.
+
+Features
+========
+
+ Checksum offload
+ ----------------
+ The netvsc driver supports checksum offload as long as the
+ Hyper-V host version does. Windows Server 2016 and Azure
+ support checksum offload for TCP and UDP for both IPv4 and
+ IPv6. Windows Server 2012 only supports checksum offload for TCP.
+
+ Receive Side Scaling
+ --------------------
+ Hyper-V supports receive side scaling. For TCP, packets are
+ distributed among available queues based on IP address and port
+ number.
+
+ For UDP, we can switch UDP hash level between L3 and L4 by ethtool
+ command. UDP over IPv4 and v6 can be set differently. The default
+ hash level is L4. We currently only allow switching TX hash level
+ from within the guests.
+
+ On Azure, fragmented UDP packets have high loss rate with L4
+ hashing. Using L3 hashing is recommended in this case.
+
+ For example, for UDP over IPv4 on eth0:
+ To include UDP port numbers in hashing:
+ ethtool -N eth0 rx-flow-hash udp4 sdfn
+ To exclude UDP port numbers in hashing:
+ ethtool -N eth0 rx-flow-hash udp4 sd
+ To show UDP hash level:
+ ethtool -n eth0 rx-flow-hash udp4
+
+ Generic Receive Offload, aka GRO
+ --------------------------------
+ The driver supports GRO and it is enabled by default. GRO coalesces
+ like packets and significantly reduces CPU usage under heavy Rx
+ load.
+
+ SR-IOV support
+ --------------
+ Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV
+ is enabled in both the vSwitch and the guest configuration, then the
+ Virtual Function (VF) device is passed to the guest as a PCI
+ device. In this case, both a synthetic (netvsc) and VF device are
+ visible in the guest OS and both NIC's have the same MAC address.
+
+ The VF is enslaved by netvsc device. The netvsc driver will transparently
+ switch the data path to the VF when it is available and up.
+ Network state (addresses, firewall, etc) should be applied only to the
+ netvsc device; the slave device should not be accessed directly in
+ most cases. The exceptions are if some special queue discipline or
+ flow direction is desired, these should be applied directly to the
+ VF slave device.
+
+ Receive Buffer
+ --------------
+ Packets are received into a receive area which is created when device
+ is probed. The receive area is broken into MTU sized chunks and each may
+ contain one or more packets. The number of receive sections may be changed
+ via ethtool Rx ring parameters.
+
+ There is a similar send buffer which is used to aggregate packets for sending.
+ The send area is broken into chunks of 6144 bytes, each of section may
+ contain one or more packets. The send buffer is an optimization, the driver
+ will use slower method to handle very large packets or if the send buffer
+ area is exhausted.
diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt
index 497d668288f9..433b6724797a 100644
--- a/Documentation/networking/nf_conntrack-sysctl.txt
+++ b/Documentation/networking/nf_conntrack-sysctl.txt
@@ -96,17 +96,6 @@ nf_conntrack_max - INTEGER
Size of connection tracking table. Default value is
nf_conntrack_buckets value * 4.
-nf_conntrack_default_on - BOOLEAN
- 0 - don't register conntrack in new net namespaces
- 1 - register conntrack in new net namespaces (default)
-
- This controls wheter newly created network namespaces have connection
- tracking enabled by default. It will be enabled automatically
- regardless of this setting if the new net namespace requires
- connection tracking, e.g. when NAT rules are created.
- This setting is only visible in initial user namespace, it has no
- effect on existing namespaces.
-
nf_conntrack_tcp_be_liberal - BOOLEAN
0 - disabled (default)
not 0 - enabled
diff --git a/Documentation/networking/rmnet.txt b/Documentation/networking/rmnet.txt
new file mode 100644
index 000000000000..6b341eaf2062
--- /dev/null
+++ b/Documentation/networking/rmnet.txt
@@ -0,0 +1,82 @@
+1. Introduction
+
+rmnet driver is used for supporting the Multiplexing and aggregation
+Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm
+Technologies, Inc. modems.
+
+This driver can be used to register onto any physical network device in
+IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator.
+
+Multiplexing allows for creation of logical netdevices (rmnet devices) to
+handle multiple private data networks (PDN) like a default internet, tethering,
+multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends
+packets with MAP headers to rmnet. Based on the multiplexer id, rmnet
+routes to the appropriate PDN after removing the MAP header.
+
+Aggregation is required to achieve high data rates. This involves hardware
+sending aggregated bunch of MAP frames. rmnet driver will de-aggregate
+these MAP frames and send them to appropriate PDN's.
+
+2. Packet format
+
+a. MAP packet (data / control)
+
+MAP header has the same endianness of the IP packet.
+
+Packet format -
+
+Bit 0 1 2-7 8 - 15 16 - 31
+Function Command / Data Reserved Pad Multiplexer ID Payload length
+Bit 32 - x
+Function Raw Bytes
+
+Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command
+or data packet. Control packet is used for transport level flow control. Data
+packets are standard IP packets.
+
+Reserved bits are usually zeroed out and to be ignored by receiver.
+
+Padding is number of bytes to be added for 4 byte alignment if required by
+hardware.
+
+Multiplexer ID is to indicate the PDN on which data has to be sent.
+
+Payload length includes the padding length but does not include MAP header
+length.
+
+b. MAP packet (command specific)
+
+Bit 0 1 2-7 8 - 15 16 - 31
+Function Command Reserved Pad Multiplexer ID Payload length
+Bit 32 - 39 40 - 45 46 - 47 48 - 63
+Function Command name Reserved Command Type Reserved
+Bit 64 - 95
+Function Transaction ID
+Bit 96 - 127
+Function Command data
+
+Command 1 indicates disabling flow while 2 is enabling flow
+
+Command types -
+0 for MAP command request
+1 is to acknowledge the receipt of a command
+2 is for unsupported commands
+3 is for error during processing of commands
+
+c. Aggregation
+
+Aggregation is multiple MAP packets (can be data or command) delivered to
+rmnet in a single linear skb. rmnet will process the individual
+packets and either ACK the MAP command or deliver the IP packet to the
+network stack as needed
+
+MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding....
+MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad...
+
+3. Userspace configuration
+
+rmnet userspace configuration is done through netlink library librmnetctl
+and command line utility rmnetcli. Utility is hosted in codeaurora forum git.
+The driver uses rtnl_link_ops for communication.
+
+https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
index 8c70ba5dee4d..810620153a44 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -818,10 +818,15 @@ The kernel interface functions are as follows:
(*) Send data through a call.
+ typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk,
+ unsigned long user_call_ID,
+ struct sk_buff *skb);
+
int rxrpc_kernel_send_data(struct socket *sock,
struct rxrpc_call *call,
struct msghdr *msg,
- size_t len);
+ size_t len,
+ rxrpc_notify_end_tx_t notify_end_rx);
This is used to supply either the request part of a client call or the
reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the
@@ -832,6 +837,11 @@ The kernel interface functions are as follows:
The msg must not specify a destination address, control data or any flags
other than MSG_MORE. len is the total amount of data to transmit.
+ notify_end_rx can be NULL or it can be used to specify a function to be
+ called when the call changes state to end the Tx phase. This function is
+ called with the call-state spinlock held to prevent any reply or final ACK
+ from being delivered first.
+
(*) Receive data from a call.
int rxrpc_kernel_recv_data(struct socket *sock,
@@ -965,6 +975,51 @@ The kernel interface functions are as follows:
size should be set when the call is begun. tx_total_len may not be less
than zero.
+ (*) Check to see the completion state of a call so that the caller can assess
+ whether it needs to be retried.
+
+ enum rxrpc_call_completion {
+ RXRPC_CALL_SUCCEEDED,
+ RXRPC_CALL_REMOTELY_ABORTED,
+ RXRPC_CALL_LOCALLY_ABORTED,
+ RXRPC_CALL_LOCAL_ERROR,
+ RXRPC_CALL_NETWORK_ERROR,
+ };
+
+ int rxrpc_kernel_check_call(struct socket *sock, struct rxrpc_call *call,
+ enum rxrpc_call_completion *_compl,
+ u32 *_abort_code);
+
+ On return, -EINPROGRESS will be returned if the call is still ongoing; if
+ it is finished, *_compl will be set to indicate the manner of completion,
+ *_abort_code will be set to any abort code that occurred. 0 will be
+ returned on a successful completion, -ECONNABORTED will be returned if the
+ client failed due to a remote abort and anything else will return an
+ appropriate error code.
+
+ The caller should look at this information to decide if it's worth
+ retrying the call.
+
+ (*) Retry a client call.
+
+ int rxrpc_kernel_retry_call(struct socket *sock,
+ struct rxrpc_call *call,
+ struct sockaddr_rxrpc *srx,
+ struct key *key);
+
+ This attempts to partially reinitialise a call and submit it again whilst
+ reusing the original call's Tx queue to avoid the need to repackage and
+ re-encrypt the data to be sent. call indicates the call to retry, srx the
+ new address to send it to and key the encryption key to use for signing or
+ encrypting the packets.
+
+ For this to work, the first Tx data packet must still be in the transmit
+ queue, and currently this is only permitted for local and network errors
+ and the call must not have been aborted. Any partially constructed Tx
+ packet is left as is and can continue being filled afterwards.
+
+ It returns 0 if the call was requeued and an error otherwise.
+
=======================
CONFIGURABLE PARAMETERS
diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.txt
index a0bf573dfa61..13081b3decef 100644
--- a/Documentation/networking/strparser.txt
+++ b/Documentation/networking/strparser.txt
@@ -1,45 +1,107 @@
-Stream Parser
--------------
+Stream Parser (strparser)
+
+Introduction
+============
The stream parser (strparser) is a utility that parses messages of an
-application layer protocol running over a TCP connection. The stream
+application layer protocol running over a data stream. The stream
parser works in conjunction with an upper layer in the kernel to provide
kernel support for application layer messages. For instance, Kernel
Connection Multiplexor (KCM) uses the Stream Parser to parse messages
using a BPF program.
+The strparser works in one of two modes: receive callback or general
+mode.
+
+In receive callback mode, the strparser is called from the data_ready
+callback of a TCP socket. Messages are parsed and delivered as they are
+received on the socket.
+
+In general mode, a sequence of skbs are fed to strparser from an
+outside source. Message are parsed and delivered as the sequence is
+processed. This modes allows strparser to be applied to arbitrary
+streams of data.
+
Interface
----------
+=========
The API includes a context structure, a set of callbacks, utility
-functions, and a data_ready function. The callbacks include
-a parse_msg function that is called to perform parsing (e.g.
-BPF parsing in case of KCM), and a rcv_msg function that is called
-when a full message has been completed.
+functions, and a data_ready function for receive callback mode. The
+callbacks include a parse_msg function that is called to perform
+parsing (e.g. BPF parsing in case of KCM), and a rcv_msg function
+that is called when a full message has been completed.
+
+Functions
+=========
+
+strp_init(struct strparser *strp, struct sock *sk,
+ const struct strp_callbacks *cb)
+
+ Called to initialize a stream parser. strp is a struct of type
+ strparser that is allocated by the upper layer. sk is the TCP
+ socket associated with the stream parser for use with receive
+ callback mode; in general mode this is set to NULL. Callbacks
+ are called by the stream parser (the callbacks are listed below).
+
+void strp_pause(struct strparser *strp)
+
+ Temporarily pause a stream parser. Message parsing is suspended
+ and no new messages are delivered to the upper layer.
+
+void strp_pause(struct strparser *strp)
+
+ Unpause a paused stream parser.
+
+void strp_stop(struct strparser *strp);
+
+ strp_stop is called to completely stop stream parser operations.
+ This is called internally when the stream parser encounters an
+ error, and it is called from the upper layer to stop parsing
+ operations.
+
+void strp_done(struct strparser *strp);
+
+ strp_done is called to release any resources held by the stream
+ parser instance. This must be called after the stream processor
+ has been stopped.
-A stream parser can be instantiated for a TCP connection. This is done
-by:
+int strp_process(struct strparser *strp, struct sk_buff *orig_skb,
+ unsigned int orig_offset, size_t orig_len,
+ size_t max_msg_size, long timeo)
-strp_init(struct strparser *strp, struct sock *csk,
- struct strp_callbacks *cb)
+ strp_process is called in general mode for a stream parser to
+ parse an sk_buff. The number of bytes processed or a negative
+ error number is returned. Note that strp_process does not
+ consume the sk_buff. max_msg_size is maximum size the stream
+ parser will parse. timeo is timeout for completing a message.
-strp is a struct of type strparser that is allocated by the upper layer.
-csk is the TCP socket associated with the stream parser. Callbacks are
-called by the stream parser.
+void strp_data_ready(struct strparser *strp);
+
+ The upper layer calls strp_tcp_data_ready when data is ready on
+ the lower socket for strparser to process. This should be called
+ from a data_ready callback that is set on the socket. Note that
+ maximum messages size is the limit of the receive socket
+ buffer and message timeout is the receive timeout for the socket.
+
+void strp_check_rcv(struct strparser *strp);
+
+ strp_check_rcv is called to check for new messages on the socket.
+ This is normally called at initialization of a stream parser
+ instance or after strp_unpause.
Callbacks
----------
+=========
-There are four callbacks:
+There are six callbacks:
int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
parse_msg is called to determine the length of the next message
in the stream. The upper layer must implement this function. It
should parse the sk_buff as containing the headers for the
- next application layer messages in the stream.
+ next application layer message in the stream.
- The skb->cb in the input skb is a struct strp_rx_msg. Only
+ The skb->cb in the input skb is a struct strp_msg. Only
the offset field is relevant in parse_msg and gives the offset
where the message starts in the skb.
@@ -50,26 +112,41 @@ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb);
-ESTRPIPE : current message should not be processed by the
kernel, return control of the socket to userspace which
can proceed to read the messages itself
- other < 0 : Error is parsing, give control back to userspace
+ other < 0 : Error in parsing, give control back to userspace
assuming that synchronization is lost and the stream
is unrecoverable (application expected to close TCP socket)
In the case that an error is returned (return value is less than
- zero) the stream parser will set the error on TCP socket and wake
- it up. If parse_msg returned -ESTRPIPE and the stream parser had
- previously read some bytes for the current message, then the error
- set on the attached socket is ENODATA since the stream is
- unrecoverable in that case.
+ zero) and the parser is in receive callback mode, then it will set
+ the error on TCP socket and wake it up. If parse_msg returned
+ -ESTRPIPE and the stream parser had previously read some bytes for
+ the current message, then the error set on the attached socket is
+ ENODATA since the stream is unrecoverable in that case.
+
+void (*lock)(struct strparser *strp)
+
+ The lock callback is called to lock the strp structure when
+ the strparser is performing an asynchronous operation (such as
+ processing a timeout). In receive callback mode the default
+ function is to lock_sock for the associated socket. In general
+ mode the callback must be set appropriately.
+
+void (*unlock)(struct strparser *strp)
+
+ The unlock callback is called to release the lock obtained
+ by the lock callback. In receive callback mode the default
+ function is release_sock for the associated socket. In general
+ mode the callback must be set appropriately.
void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
rcv_msg is called when a full message has been received and
is queued. The callee must consume the sk_buff; it can
call strp_pause to prevent any further messages from being
- received in rcv_msg (see strp_pause below). This callback
+ received in rcv_msg (see strp_pause above). This callback
must be set.
- The skb->cb in the input skb is a struct strp_rx_msg. This
+ The skb->cb in the input skb is a struct strp_msg. This
struct contains two fields: offset and full_len. Offset is
where the message starts in the skb, and full_len is the
the length of the message. skb->len - offset may be greater
@@ -78,59 +155,53 @@ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb);
int (*read_sock_done)(struct strparser *strp, int err);
read_sock_done is called when the stream parser is done reading
- the TCP socket. The stream parser may read multiple messages
- in a loop and this function allows cleanup to occur when existing
- the loop. If the callback is not set (NULL in strp_init) a
- default function is used.
+ the TCP socket in receive callback mode. The stream parser may
+ read multiple messages in a loop and this function allows cleanup
+ to occur when exiting the loop. If the callback is not set (NULL
+ in strp_init) a default function is used.
void (*abort_parser)(struct strparser *strp, int err);
This function is called when stream parser encounters an error
- in parsing. The default function stops the stream parser for the
- TCP socket and sets the error in the socket. The default function
- can be changed by setting the callback to non-NULL in strp_init.
+ in parsing. The default function stops the stream parser and
+ sets the error in the socket if the parser is in receive callback
+ mode. The default function can be changed by setting the callback
+ to non-NULL in strp_init.
-Functions
----------
+Statistics
+==========
-The upper layer calls strp_tcp_data_ready when data is ready on the lower
-socket for strparser to process. This should be called from a data_ready
-callback that is set on the socket.
+Various counters are kept for each stream parser instance. These are in
+the strp_stats structure. strp_aggr_stats is a convenience structure for
+accumulating statistics for multiple stream parser instances.
+save_strp_stats and aggregate_strp_stats are helper functions to save
+and aggregate statistics.
-strp_stop is called to completely stop stream parser operations. This
-is called internally when the stream parser encounters an error, and
-it is called from the upper layer when unattaching a TCP socket.
+Message assembly limits
+=======================
-strp_done is called to unattach the stream parser from the TCP socket.
-This must be called after the stream processor has be stopped.
+The stream parser provide mechanisms to limit the resources consumed by
+message assembly.
-strp_check_rcv is called to check for new messages on the socket. This
-is normally called at initialization of the a stream parser instance
-of after strp_unpause.
+A timer is set when assembly starts for a new message. In receive
+callback mode the message timeout is taken from rcvtime for the
+associated TCP socket. In general mode, the timeout is passed as an
+argument in strp_process. If the timer fires before assembly completes
+the stream parser is aborted and the ETIMEDOUT error is set on the TCP
+socket if in receive callback mode.
-Statistics
-----------
+In receive callback mode, message length is limited to the receive
+buffer size of the associated TCP socket. If the length returned by
+parse_msg is greater than the socket buffer size then the stream parser
+is aborted with EMSGSIZE error set on the TCP socket. Note that this
+makes the maximum size of receive skbuffs for a socket with a stream
+parser to be 2*sk_rcvbuf of the TCP socket.
-Various counters are kept for each stream parser for a TCP socket.
-These are in the strp_stats structure. strp_aggr_stats is a convenience
-structure for accumulating statistics for multiple stream parser
-instances. save_strp_stats and aggregate_strp_stats are helper functions
-to save and aggregate statistics.
+In general mode the message length limit is passed in as an argument
+to strp_process.
-Message assembly limits
------------------------
+Author
+======
-The stream parser provide mechanisms to limit the resources consumed by
-message assembly.
+Tom Herbert (tom@quantonium.net)
-A timer is set when assembly starts for a new message. The message
-timeout is taken from rcvtime for the associated TCP socket. If the
-timer fires before assembly completes the stream parser is aborted
-and the ETIMEDOUT error is set on the TCP socket.
-
-Message length is limited to the receive buffer size of the associated
-TCP socket. If the length returned by parse_msg is greater than
-the socket buffer size then the stream parser is aborted with
-EMSGSIZE error set on the TCP socket. Note that this makes the
-maximum size of receive skbuffs for a socket with a stream parser
-to be 2*sk_rcvbuf of the TCP socket.
diff --git a/Documentation/nvmem/nvmem.txt b/Documentation/nvmem/nvmem.txt
index dbd40d879239..8d8d8f58f96f 100644
--- a/Documentation/nvmem/nvmem.txt
+++ b/Documentation/nvmem/nvmem.txt
@@ -112,7 +112,7 @@ take nvmem_device as parameter.
5. Releasing a reference to the NVMEM
=====================================
-When a consumers no longer needs the NVMEM, it has to release the reference
+When a consumer no longer needs the NVMEM, it has to release the reference
to the NVMEM it has obtained using the APIs mentioned in the above section.
The NVMEM framework provides 2 APIs to release a reference to the NVMEM.
diff --git a/Documentation/power/states.txt b/Documentation/power/states.txt
deleted file mode 100644
index bc4548245a24..000000000000
--- a/Documentation/power/states.txt
+++ /dev/null
@@ -1,125 +0,0 @@
-System Power Management Sleep States
-
-(C) 2014 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com>
-
-The kernel supports up to four system sleep states generically, although three
-of them depend on the platform support code to implement the low-level details
-for each state.
-
-The states are represented by strings that can be read or written to the
-/sys/power/state file. Those strings may be "mem", "standby", "freeze" and
-"disk", where the last three always represent Power-On Suspend (if supported),
-Suspend-To-Idle and hibernation (Suspend-To-Disk), respectively.
-
-The meaning of the "mem" string is controlled by the /sys/power/mem_sleep file.
-It contains strings representing the available modes of system suspend that may
-be triggered by writing "mem" to /sys/power/state. These modes are "s2idle"
-(Suspend-To-Idle), "shallow" (Power-On Suspend) and "deep" (Suspend-To-RAM).
-The "s2idle" mode is always available, while the other ones are only available
-if supported by the platform (if not supported, the strings representing them
-are not present in /sys/power/mem_sleep). The string representing the suspend
-mode to be used subsequently is enclosed in square brackets. Writing one of
-the other strings present in /sys/power/mem_sleep to it causes the suspend mode
-to be used subsequently to change to the one represented by that string.
-
-Consequently, there are two ways to cause the system to go into the
-Suspend-To-Idle sleep state. The first one is to write "freeze" directly to
-/sys/power/state. The second one is to write "s2idle" to /sys/power/mem_sleep
-and then to write "mem" to /sys/power/state. Similarly, there are two ways
-to cause the system to go into the Power-On Suspend sleep state (the strings to
-write to the control files in that case are "standby" or "shallow" and "mem",
-respectively) if that state is supported by the platform. In turn, there is
-only one way to cause the system to go into the Suspend-To-RAM state (write
-"deep" into /sys/power/mem_sleep and "mem" into /sys/power/state).
-
-The default suspend mode (ie. the one to be used without writing anything into
-/sys/power/mem_sleep) is either "deep" (if Suspend-To-RAM is supported) or
-"s2idle", but it can be overridden by the value of the "mem_sleep_default"
-parameter in the kernel command line.
-
-The properties of all of the sleep states are described below.
-
-
-State: Suspend-To-Idle
-ACPI state: S0
-Label: "s2idle" ("freeze")
-
-This state is a generic, pure software, light-weight, system sleep state.
-It allows more energy to be saved relative to runtime idle by freezing user
-space and putting all I/O devices into low-power states (possibly
-lower-power than available at run time), such that the processors can
-spend more time in their idle states.
-
-This state can be used for platforms without Power-On Suspend/Suspend-to-RAM
-support, or it can be used in addition to Suspend-to-RAM to provide reduced
-resume latency. It is always supported.
-
-
-State: Standby / Power-On Suspend
-ACPI State: S1
-Label: "shallow" ("standby")
-
-This state, if supported, offers moderate, though real, power savings, while
-providing a relatively low-latency transition back to a working system. No
-operating state is lost (the CPU retains power), so the system easily starts up
-again where it left off.
-
-In addition to freezing user space and putting all I/O devices into low-power
-states, which is done for Suspend-To-Idle too, nonboot CPUs are taken offline
-and all low-level system functions are suspended during transitions into this
-state. For this reason, it should allow more energy to be saved relative to
-Suspend-To-Idle, but the resume latency will generally be greater than for that
-state.
-
-
-State: Suspend-to-RAM
-ACPI State: S3
-Label: "deep"
-
-This state, if supported, offers significant power savings as everything in the
-system is put into a low-power state, except for memory, which should be placed
-into the self-refresh mode to retain its contents. All of the steps carried out
-when entering Power-On Suspend are also carried out during transitions to STR.
-Additional operations may take place depending on the platform capabilities. In
-particular, on ACPI systems the kernel passes control to the BIOS (platform
-firmware) as the last step during STR transitions and that usually results in
-powering down some more low-level components that aren't directly controlled by
-the kernel.
-
-System and device state is saved and kept in memory. All devices are suspended
-and put into low-power states. In many cases, all peripheral buses lose power
-when entering STR, so devices must be able to handle the transition back to the
-"on" state.
-
-For at least ACPI, STR requires some minimal boot-strapping code to resume the
-system from it. This may be the case on other platforms too.
-
-
-State: Suspend-to-disk
-ACPI State: S4
-Label: "disk"
-
-This state offers the greatest power savings, and can be used even in
-the absence of low-level platform support for power management. This
-state operates similarly to Suspend-to-RAM, but includes a final step
-of writing memory contents to disk. On resume, this is read and memory
-is restored to its pre-suspend state.
-
-STD can be handled by the firmware or the kernel. If it is handled by
-the firmware, it usually requires a dedicated partition that must be
-setup via another operating system for it to use. Despite the
-inconvenience, this method requires minimal work by the kernel, since
-the firmware will also handle restoring memory contents on resume.
-
-For suspend-to-disk, a mechanism called 'swsusp' (Swap Suspend) is used
-to write memory contents to free swap space. swsusp has some restrictive
-requirements, but should work in most cases. Some, albeit outdated,
-documentation can be found in Documentation/power/swsusp.txt.
-Alternatively, userspace can do most of the actual suspend to disk work,
-see userland-swsusp.txt.
-
-Once memory state is written to disk, the system may either enter a
-low-power state (like ACPI S4), or it may simply power down. Powering
-down offers greater savings, and allows this mechanism to work on any
-system. However, entering a real low-power state allows the user to
-trigger wake up events (e.g. pressing a key or opening a laptop lid).
diff --git a/Documentation/printk-formats.txt b/Documentation/printk-formats.txt
index 074670b98bac..361789df51ec 100644
--- a/Documentation/printk-formats.txt
+++ b/Documentation/printk-formats.txt
@@ -75,6 +75,16 @@ used when printing stack backtraces. The specifier takes into
consideration the effect of compiler optimisations which may occur
when tail-call``s are used and marked with the noreturn GCC attribute.
+Examples::
+
+ printk("Going to call: %pF\n", gettimeofday);
+ printk("Going to call: %pF\n", p->func);
+ printk("%s: called from %pS\n", __func__, (void *)_RET_IP_);
+ printk("%s: called from %pS\n", __func__,
+ (void *)__builtin_return_address(0));
+ printk("Faulted at %pS\n", (void *)regs->ip);
+ printk(" %s%pB\n", (reliable ? "" : "? "), (void *)*stack);
+
Kernel Pointers
===============
diff --git a/Documentation/process/applying-patches.rst b/Documentation/process/applying-patches.rst
index a0d058cc6d25..dc2ddc345044 100644
--- a/Documentation/process/applying-patches.rst
+++ b/Documentation/process/applying-patches.rst
@@ -6,9 +6,6 @@ Applying Patches To The Linux Kernel
Original by:
Jesper Juhl, August 2005
-Last update:
- 2016-09-14
-
.. note::
This document is obsolete. In most cases, rather than using ``patch``
@@ -344,7 +341,7 @@ possible.
This is a good branch to run for people who want to help out testing
development kernels but do not want to run some of the really experimental
-stuff (such people should see the sections about -git and -mm kernels below).
+stuff (such people should see the sections about -next and -mm kernels below).
The -rc patches are not incremental, they apply to a base 4.x kernel, just
like the 4.x.y patches described above. The kernel version before the -rcN
@@ -380,44 +377,6 @@ Here are 3 examples of how to apply these patches::
$ mv linux-4.7.3 linux-4.8-rc5 # rename the kernel source dir
-The -git kernels
-================
-
-These are daily snapshots of Linus' kernel tree (managed in a git
-repository, hence the name).
-
-These patches are usually released daily and represent the current state of
-Linus's tree. They are more experimental than -rc kernels since they are
-generated automatically without even a cursory glance to see if they are
-sane.
-
--git patches are not incremental and apply either to a base 4.x kernel or
-a base 4.x-rc kernel -- you can see which from their name.
-A patch named 4.7-git1 applies to the 4.7 kernel source and a patch
-named 4.8-rc3-git2 applies to the source of the 4.8-rc3 kernel.
-
-Here are some examples of how to apply these patches::
-
- # moving from 4.7 to 4.7-git1
-
- $ cd ~/linux-4.7 # change to the kernel source dir
- $ patch -p1 < ../patch-4.7-git1 # apply the 4.7-git1 patch
- $ cd ..
- $ mv linux-4.7 linux-4.7-git1 # rename the kernel source dir
-
- # moving from 4.7-git1 to 4.8-rc2-git3
-
- $ cd ~/linux-4.7-git1 # change to the kernel source dir
- $ patch -p1 -R < ../patch-4.7-git1 # revert the 4.7-git1 patch
- # we now have a 4.7 kernel
- $ patch -p1 < ../patch-4.8-rc2 # apply the 4.8-rc2 patch
- # the kernel is now 4.8-rc2
- $ patch -p1 < ../patch-4.8-rc2-git3 # apply the 4.8-rc2-git3 patch
- # the kernel is now 4.8-rc2-git3
- $ cd ..
- $ mv linux-4.7-git1 linux-4.8-rc2-git3 # rename source dir
-
-
The -mm patches and the linux-next tree
=======================================
diff --git a/Documentation/process/changes.rst b/Documentation/process/changes.rst
index adbb50ae5246..560beaef5a7c 100644
--- a/Documentation/process/changes.rst
+++ b/Documentation/process/changes.rst
@@ -53,7 +53,7 @@ mcelog 0.6 mcelog --version
iptables 1.4.2 iptables -V
openssl & libcrypto 1.0.0 openssl version
bc 1.06.95 bc --version
-Sphinx\ [#f1]_ 1.2 sphinx-build --version
+Sphinx\ [#f1]_ 1.3 sphinx-build --version
====================== =============== ========================================
.. [#f1] Sphinx is needed only to build the Kernel documentation
@@ -309,18 +309,8 @@ Kernel documentation
Sphinx
------
-The ReST markups currently used by the Documentation/ files are meant to be
-built with ``Sphinx`` version 1.2 or upper. If you're desiring to build
-PDF outputs, it is recommended to use version 1.4.6.
-
-.. note::
-
- Please notice that, for PDF and LaTeX output, you'll also need ``XeLaTeX``
- version 3.14159265. Depending on the distribution, you may also need to
- install a series of ``texlive`` packages that provide the minimal set of
- functionalities required for ``XeLaTex`` to work. For PDF output you'll also
- need ``convert(1)`` from ImageMagick (https://www.imagemagick.org).
-
+Please see :ref:`sphinx_install` in ``Documentation/doc-guide/sphinx.rst``
+for details about Sphinx requirements.
Getting updated software
========================
diff --git a/Documentation/process/stable-kernel-rules.rst b/Documentation/process/stable-kernel-rules.rst
index 61e9c78bd6d1..36a2dded525b 100644
--- a/Documentation/process/stable-kernel-rules.rst
+++ b/Documentation/process/stable-kernel-rules.rst
@@ -166,12 +166,12 @@ Trees
- The queues of patches, for both completed versions and in progress
versions can be found at:
- http://git.kernel.org/?p=linux/kernel/git/stable/stable-queue.git
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/stable-queue.git
- The finalized and tagged releases of all stable kernels can be found
in separate branches per version at:
- http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git
Review committee
diff --git a/Documentation/process/submitting-patches.rst b/Documentation/process/submitting-patches.rst
index 3e10719fee35..733478ade91b 100644
--- a/Documentation/process/submitting-patches.rst
+++ b/Documentation/process/submitting-patches.rst
@@ -413,7 +413,7 @@ e-mail discussions.
-11) Sign your work — the Developer's Certificate of Origin
+11) Sign your work - the Developer's Certificate of Origin
----------------------------------------------------------
To improve tracking of who did what, especially with patches that can
diff --git a/Documentation/security/keys/core.rst b/Documentation/security/keys/core.rst
index 1648fa80b3bf..1266eeae45f6 100644
--- a/Documentation/security/keys/core.rst
+++ b/Documentation/security/keys/core.rst
@@ -16,17 +16,7 @@ The key service can be configured on by enabling:
This document has the following sections:
- - Key overview
- - Key service overview
- - Key access permissions
- - SELinux support
- - New procfs files
- - Userspace system call interface
- - Kernel services
- - Notes on accessing payload contents
- - Defining a key type
- - Request-key callback service
- - Garbage collection
+.. contents:: :local:
Key Overview
@@ -443,7 +433,7 @@ The main syscalls are:
/sbin/request-key will be invoked in an attempt to obtain a key. The
callout_info string will be passed as an argument to the program.
- See also Documentation/security/keys-request-key.txt.
+ See also Documentation/security/keys/request-key.rst.
The keyctl syscall functions are:
@@ -973,7 +963,7 @@ payload contents" for more information.
If successful, the key will have been attached to the default keyring for
implicitly obtained request-key keys, as set by KEYCTL_SET_REQKEY_KEYRING.
- See also Documentation/security/keys-request-key.txt.
+ See also Documentation/security/keys/request-key.rst.
* To search for a key, passing auxiliary data to the upcaller, call::
diff --git a/Documentation/security/keys/request-key.rst b/Documentation/security/keys/request-key.rst
index aba32784174c..b2d16abaa9e9 100644
--- a/Documentation/security/keys/request-key.rst
+++ b/Documentation/security/keys/request-key.rst
@@ -3,7 +3,7 @@ Key Request Service
===================
The key request service is part of the key retention service (refer to
-Documentation/security/keys.txt). This document explains more fully how
+Documentation/security/core.rst). This document explains more fully how
the requesting algorithm works.
The process starts by either the kernel requesting a service by calling
diff --git a/Documentation/security/keys/trusted-encrypted.rst b/Documentation/security/keys/trusted-encrypted.rst
index 7b503831bdea..3bb24e09a332 100644
--- a/Documentation/security/keys/trusted-encrypted.rst
+++ b/Documentation/security/keys/trusted-encrypted.rst
@@ -172,4 +172,4 @@ Other uses for trusted and encrypted keys, such as for disk and file encryption
are anticipated. In particular the new format 'ecryptfs' has been defined in
in order to use encrypted keys to mount an eCryptfs filesystem. More details
about the usage can be found in the file
-``Documentation/security/keys-ecryptfs.txt``.
+``Documentation/security/keys/ecryptfs.rst``.
diff --git a/Documentation/sphinx-static/theme_overrides.css b/Documentation/sphinx-static/theme_overrides.css
index d5764a4de5a2..522b6d4c49d4 100644
--- a/Documentation/sphinx-static/theme_overrides.css
+++ b/Documentation/sphinx-static/theme_overrides.css
@@ -4,6 +4,17 @@
*
*/
+/* Interim: Code-blocks with line nos - lines and line numbers don't line up.
+ * see: https://github.com/rtfd/sphinx_rtd_theme/issues/419
+ */
+
+div[class^="highlight"] pre {
+ line-height: normal;
+}
+.rst-content .highlight > pre {
+ line-height: normal;
+}
+
@media screen {
/* content column
@@ -56,6 +67,12 @@
font-family: "Courier New", Courier, monospace
}
+ /* fix bottom margin of lists items */
+
+ .rst-content .section ul li:last-child, .rst-content .section ul li p:last-child {
+ margin-bottom: 12px;
+ }
+
/* inline literal: drop the borderbox, padding and red color */
code, .rst-content tt, .rst-content code {
diff --git a/Documentation/sphinx/kerneldoc.py b/Documentation/sphinx/kerneldoc.py
index d15e07f36881..39aa9e8697cc 100644
--- a/Documentation/sphinx/kerneldoc.py
+++ b/Documentation/sphinx/kerneldoc.py
@@ -27,6 +27,7 @@
# Please make sure this works on both python2 and python3.
#
+import codecs
import os
import subprocess
import sys
@@ -88,13 +89,10 @@ class KernelDocDirective(Directive):
try:
env.app.verbose('calling kernel-doc \'%s\'' % (" ".join(cmd)))
- p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
+ p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
- # python2 needs conversion to unicode.
- # python3 with universal_newlines=True returns strings.
- if sys.version_info.major < 3:
- out, err = unicode(out, 'utf-8'), unicode(err, 'utf-8')
+ out, err = codecs.decode(out, 'utf-8'), codecs.decode(err, 'utf-8')
if p.returncode != 0:
sys.stderr.write(err)
diff --git a/Documentation/sphinx/requirements.txt b/Documentation/sphinx/requirements.txt
new file mode 100644
index 000000000000..742be3e12619
--- /dev/null
+++ b/Documentation/sphinx/requirements.txt
@@ -0,0 +1,3 @@
+docutils==0.12
+Sphinx==1.4.9
+sphinx_rtd_theme
diff --git a/Documentation/static-keys.txt b/Documentation/static-keys.txt
index b83dfa1c0602..ab16efe0c79d 100644
--- a/Documentation/static-keys.txt
+++ b/Documentation/static-keys.txt
@@ -149,6 +149,26 @@ static_branch_inc(), will change the branch back to true. Likewise, if the
key is initialized false, a 'static_branch_inc()', will change the branch to
true. And then a 'static_branch_dec()', will again make the branch false.
+The state and the reference count can be retrieved with 'static_key_enabled()'
+and 'static_key_count()'. In general, if you use these functions, they
+should be protected with the same mutex used around the enable/disable
+or increment/decrement function.
+
+Note that switching branches results in some locks being taken,
+particularly the CPU hotplug lock (in order to avoid races against
+CPUs being brought in the kernel whilst the kernel is getting
+patched). Calling the static key API from within a hotplug notifier is
+thus a sure deadlock recipe. In order to still allow use of the
+functionnality, the following functions are provided:
+
+ static_key_enable_cpuslocked()
+ static_key_disable_cpuslocked()
+ static_branch_enable_cpuslocked()
+ static_branch_disable_cpuslocked()
+
+These functions are *not* general purpose, and must only be used when
+you really know that you're in the above context, and no other.
+
Where an array of keys is required, it can be defined as::
DEFINE_STATIC_KEY_ARRAY_TRUE(keys, count);
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index bac23c198360..ce61d1fe08ca 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -61,6 +61,7 @@ show up in /proc/sys/kernel:
- perf_cpu_time_max_percent
- perf_event_paranoid
- perf_event_max_stack
+- perf_event_mlock_kb
- perf_event_max_contexts_per_stack
- pid_max
- powersave-nap [ PPC only ]
@@ -654,7 +655,9 @@ Controls use of the performance events system by unprivileged
users (without CAP_SYS_ADMIN). The default value is 2.
-1: Allow use of (almost) all events by all users
->=0: Disallow raw tracepoint access by users without CAP_IOC_LOCK
+ Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
+>=0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
+ Disallow raw tracepoint access by users without CAP_SYS_ADMIN
>=1: Disallow CPU event access by users without CAP_SYS_ADMIN
>=2: Disallow kernel profiling by users without CAP_SYS_ADMIN
@@ -673,6 +676,14 @@ The default value is 127.
==============================================================
+perf_event_mlock_kb:
+
+Control size of per-cpu ring buffer not counted agains mlock limit.
+
+The default value is 512 + 1 page
+
+==============================================================
+
perf_event_max_contexts_per_stack:
Controls maximum number of stack frame context entries for
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
index 28596e03220b..b67044a2575f 100644
--- a/Documentation/sysctl/net.txt
+++ b/Documentation/sysctl/net.txt
@@ -46,13 +46,13 @@ translate these BPF proglets into native CPU instructions. There are
two flavors of JITs, the newer eBPF JIT currently supported on:
- x86_64
- arm64
+ - arm32
- ppc64
- sparc64
- mips64
- s390x
And the older cBPF JIT supported on the following archs:
- - arm
- mips
- ppc
- sparc
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 48244c42ff52..9baf66a9ef4e 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -572,7 +572,9 @@ See Documentation/nommu-mmap.txt for more information.
numa_zonelist_order
-This sysctl is only for NUMA.
+This sysctl is only for NUMA and it is deprecated. Anything but
+Node order will fail!
+
'where the memory is allocated from' is controlled by zonelists.
(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
you may be able to read ZONE_DMA as ZONE_DMA32...)
diff --git a/Documentation/trace/stm.txt b/Documentation/trace/stm.txt
index 11cff47eecce..03765750104b 100644
--- a/Documentation/trace/stm.txt
+++ b/Documentation/trace/stm.txt
@@ -83,7 +83,7 @@ by writing the name of the desired stm device there, for example:
$ echo dummy_stm.0 > /sys/class/stm_source/console/stm_source_link
For examples on how to use stm_source interface in the kernel, refer
-to stm_console or stm_heartbeat drivers.
+to stm_console, stm_heartbeat or stm_ftrace drivers.
Each stm_source device will need to assume a master and a range of
channels, depending on how many channels it requires. These are
@@ -107,5 +107,16 @@ console in the STP stream, create a "console" policy entry (see the
beginning of this text on how to do that). When initialized, it will
consume one channel.
+stm_ftrace
+==========
+
+This is another "stm_source" device, once the stm_ftrace has been
+linked with an stm device, and if "function" tracer is enabled,
+function address and parent function address which Ftrace subsystem
+would store into ring buffer will be exported via the stm device at
+the same time.
+
+Currently only Ftrace "function" tracer is supported.
+
[1] https://software.intel.com/sites/default/files/managed/d3/3c/intel-th-developer-manual.pdf
[2] http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0444b/index.html
diff --git a/Documentation/translations/ko_KR/memory-barriers.txt b/Documentation/translations/ko_KR/memory-barriers.txt
index 38310dcd6620..bc80fc0e210f 100644
--- a/Documentation/translations/ko_KR/memory-barriers.txt
+++ b/Documentation/translations/ko_KR/memory-barriers.txt
@@ -1956,10 +1956,7 @@ MMIO 쓰기 배리어
뒤에 완료됩니다.
ACQUIRE 앞에서 요청된 메모리 오퍼레이션은 ACQUIRE 오퍼레이션이 완료된 후에
- 완료될 수 있습니다. smp_mb__before_spinlock() 뒤에 ACQUIRE 가 실행되는
- 코드 블록은 블록 앞의 스토어를 블록 뒤의 로드와 스토어에 대해 순서
- 맞춥니다. 이건 smp_mb() 보다 완화된 것임을 기억하세요! 많은 아키텍쳐에서
- smp_mb__before_spinlock() 은 사실 아무일도 하지 않습니다.
+ 완료될 수 있습니다.
(2) RELEASE 오퍼레이션의 영향:
diff --git a/Documentation/translations/zh_CN/HOWTO b/Documentation/translations/zh_CN/HOWTO
index 11be075ba5fa..5f6d09edc9ac 100644
--- a/Documentation/translations/zh_CN/HOWTO
+++ b/Documentation/translations/zh_CN/HOWTO
@@ -149,9 +149,7 @@ Linux内核代码中包含有大量的文档。这些文档对于学习如何与
核源码的主目录中使用以下不同命令将会分别生成PDF、Postscript、HTML和手册
页等不同格式的文档:
make pdfdocs
- make psdocs
make htmldocs
- make mandocs
如何成为内核开发者
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
index a08f71647714..a31b85b9bb88 100644
--- a/Documentation/vm/numa
+++ b/Documentation/vm/numa
@@ -79,11 +79,8 @@ memory, Linux must decide whether to order the zonelists such that allocations
fall back to the same zone type on a different node, or to a different zone
type on the same node. This is an important consideration because some zones,
such as DMA or DMA32, represent relatively scarce resources. Linux chooses
-a default zonelist order based on the sizes of the various zone types relative
-to the total memory of the node and the total memory of the system. The
-default zonelist order may be overridden using the numa_zonelist_order kernel
-boot parameter or sysctl. [see Documentation/admin-guide/kernel-parameters.rst and
-Documentation/sysctl/vm.txt]
+a default Node ordered zonelist. This means it tries to fallback to other zones
+from the same node before using remote nodes which are ordered by NUMA distance.
By default, Linux will attempt to satisfy memory allocation requests from the
node to which the CPU that executes the request is assigned. Specifically,
diff --git a/Documentation/vm/swap_numa.txt b/Documentation/vm/swap_numa.txt
new file mode 100644
index 000000000000..d5960c9124f5
--- /dev/null
+++ b/Documentation/vm/swap_numa.txt
@@ -0,0 +1,69 @@
+Automatically bind swap device to numa node
+-------------------------------------------
+
+If the system has more than one swap device and swap device has the node
+information, we can make use of this information to decide which swap
+device to use in get_swap_pages() to get better performance.
+
+
+How to use this feature
+-----------------------
+
+Swap device has priority and that decides the order of it to be used. To make
+use of automatically binding, there is no need to manipulate priority settings
+for swap devices. e.g. on a 2 node machine, assume 2 swap devices swapA and
+swapB, with swapA attached to node 0 and swapB attached to node 1, are going
+to be swapped on. Simply swapping them on by doing:
+# swapon /dev/swapA
+# swapon /dev/swapB
+
+Then node 0 will use the two swap devices in the order of swapA then swapB and
+node 1 will use the two swap devices in the order of swapB then swapA. Note
+that the order of them being swapped on doesn't matter.
+
+A more complex example on a 4 node machine. Assume 6 swap devices are going to
+be swapped on: swapA and swapB are attached to node 0, swapC is attached to
+node 1, swapD and swapE are attached to node 2 and swapF is attached to node3.
+The way to swap them on is the same as above:
+# swapon /dev/swapA
+# swapon /dev/swapB
+# swapon /dev/swapC
+# swapon /dev/swapD
+# swapon /dev/swapE
+# swapon /dev/swapF
+
+Then node 0 will use them in the order of:
+swapA/swapB -> swapC -> swapD -> swapE -> swapF
+swapA and swapB will be used in a round robin mode before any other swap device.
+
+node 1 will use them in the order of:
+swapC -> swapA -> swapB -> swapD -> swapE -> swapF
+
+node 2 will use them in the order of:
+swapD/swapE -> swapA -> swapB -> swapC -> swapF
+Similaly, swapD and swapE will be used in a round robin mode before any
+other swap devices.
+
+node 3 will use them in the order of:
+swapF -> swapA -> swapB -> swapC -> swapD -> swapE
+
+
+Implementation details
+----------------------
+
+The current code uses a priority based list, swap_avail_list, to decide
+which swap device to use and if multiple swap devices share the same
+priority, they are used round robin. This change here replaces the single
+global swap_avail_list with a per-numa-node list, i.e. for each numa node,
+it sees its own priority based list of available swap devices. Swap
+device's priority can be promoted on its matching node's swap_avail_list.
+
+The current swap device's priority is set as: user can set a >=0 value,
+or the system will pick one starting from -1 then downwards. The priority
+value in the swap_avail_list is the negated value of the swap device's
+due to plist being sorted from low to high. The new policy doesn't change
+the semantics for priority >=0 cases, the previous starting from -1 then
+downwards now becomes starting from -2 then downwards and -1 is reserved
+as the promoted value. So if multiple swap devices are attached to the same
+node, they will all be promoted to priority -1 on that node's plist and will
+be used round robin before any other swap devices.
diff --git a/Documentation/x86/early-microcode.txt b/Documentation/x86/early-microcode.txt
deleted file mode 100644
index 07749e7f3d50..000000000000
--- a/Documentation/x86/early-microcode.txt
+++ /dev/null
@@ -1,70 +0,0 @@
-Early load microcode
-====================
-By Fenghua Yu <fenghua.yu@intel.com>
-
-Kernel can update microcode in early phase of boot time. Loading microcode early
-can fix CPU issues before they are observed during kernel boot time.
-
-Microcode is stored in an initrd file. The microcode is read from the initrd
-file and loaded to CPUs during boot time.
-
-The format of the combined initrd image is microcode in cpio format followed by
-the initrd image (maybe compressed). Kernel parses the combined initrd image
-during boot time. The microcode file in cpio name space is:
-on Intel: kernel/x86/microcode/GenuineIntel.bin
-on AMD : kernel/x86/microcode/AuthenticAMD.bin
-
-During BSP boot (before SMP starts), if the kernel finds the microcode file in
-the initrd file, it parses the microcode and saves matching microcode in memory.
-If matching microcode is found, it will be uploaded in BSP and later on in all
-APs.
-
-The cached microcode patch is applied when CPUs resume from a sleep state.
-
-There are two legacy user space interfaces to load microcode, either through
-/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file
-in sysfs.
-
-In addition to these two legacy methods, the early loading method described
-here is the third method with which microcode can be uploaded to a system's
-CPUs.
-
-The following example script shows how to generate a new combined initrd file in
-/boot/initrd-3.5.0.ucode.img with original microcode microcode.bin and
-original initrd image /boot/initrd-3.5.0.img.
-
-mkdir initrd
-cd initrd
-mkdir -p kernel/x86/microcode
-cp ../microcode.bin kernel/x86/microcode/GenuineIntel.bin (or AuthenticAMD.bin)
-find . | cpio -o -H newc >../ucode.cpio
-cd ..
-cat ucode.cpio /boot/initrd-3.5.0.img >/boot/initrd-3.5.0.ucode.img
-
-Builtin microcode
-=================
-
-We can also load builtin microcode supplied through the regular firmware
-builtin method CONFIG_FIRMWARE_IN_KERNEL. Only 64-bit is currently
-supported.
-
-Here's an example:
-
-CONFIG_FIRMWARE_IN_KERNEL=y
-CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin"
-CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
-
-This basically means, you have the following tree structure locally:
-
-/lib/firmware/
-|-- amd-ucode
-...
-| |-- microcode_amd_fam15h.bin
-...
-|-- intel-ucode
-...
-| |-- 06-3a-09
-...
-
-so that the build system can find those files and integrate them into
-the final kernel image. The early loader finds them and applies them.
diff --git a/Documentation/x86/intel_rdt_ui.txt b/Documentation/x86/intel_rdt_ui.txt
index c491a1b82de2..4d8848e4e224 100644
--- a/Documentation/x86/intel_rdt_ui.txt
+++ b/Documentation/x86/intel_rdt_ui.txt
@@ -6,8 +6,8 @@ Fenghua Yu <fenghua.yu@intel.com>
Tony Luck <tony.luck@intel.com>
Vikas Shivappa <vikas.shivappa@intel.com>
-This feature is enabled by the CONFIG_INTEL_RDT_A Kconfig and the
-X86 /proc/cpuinfo flag bits "rdt", "cat_l3" and "cdp_l3".
+This feature is enabled by the CONFIG_INTEL_RDT Kconfig and the
+X86 /proc/cpuinfo flag bits "rdt", "cqm", "cat_l3" and "cdp_l3".
To use the feature mount the file system:
@@ -17,6 +17,13 @@ mount options are:
"cdp": Enable code/data prioritization in L3 cache allocations.
+RDT features are orthogonal. A particular system may support only
+monitoring, only control, or both monitoring and control.
+
+The mount succeeds if either of allocation or monitoring is present, but
+only those files and directories supported by the system will be created.
+For more details on the behavior of the interface during monitoring
+and allocation, see the "Resource alloc and monitor groups" section.
Info directory
--------------
@@ -24,7 +31,12 @@ Info directory
The 'info' directory contains information about the enabled
resources. Each resource has its own subdirectory. The subdirectory
names reflect the resource names.
-Cache resource(L3/L2) subdirectory contains the following files:
+
+Each subdirectory contains the following files with respect to
+allocation:
+
+Cache resource(L3/L2) subdirectory contains the following files
+related to allocation:
"num_closids": The number of CLOSIDs which are valid for this
resource. The kernel uses the smallest number of
@@ -36,7 +48,15 @@ Cache resource(L3/L2) subdirectory contains the following files:
"min_cbm_bits": The minimum number of consecutive bits which
must be set when writing a mask.
-Memory bandwitdh(MB) subdirectory contains the following files:
+"shareable_bits": Bitmask of shareable resource with other executing
+ entities (e.g. I/O). User can use this when
+ setting up exclusive cache partitions. Note that
+ some platforms support devices that have their
+ own settings for cache use which can over-ride
+ these bits.
+
+Memory bandwitdh(MB) subdirectory contains the following files
+with respect to allocation:
"min_bandwidth": The minimum memory bandwidth percentage which
user can request.
@@ -52,48 +72,152 @@ Memory bandwitdh(MB) subdirectory contains the following files:
non-linear. This field is purely informational
only.
-Resource groups
----------------
+If RDT monitoring is available there will be an "L3_MON" directory
+with the following files:
+
+"num_rmids": The number of RMIDs available. This is the
+ upper bound for how many "CTRL_MON" + "MON"
+ groups can be created.
+
+"mon_features": Lists the monitoring events if
+ monitoring is enabled for the resource.
+
+"max_threshold_occupancy":
+ Read/write file provides the largest value (in
+ bytes) at which a previously used LLC_occupancy
+ counter can be considered for re-use.
+
+
+Resource alloc and monitor groups
+---------------------------------
+
Resource groups are represented as directories in the resctrl file
-system. The default group is the root directory. Other groups may be
-created as desired by the system administrator using the "mkdir(1)"
-command, and removed using "rmdir(1)".
+system. The default group is the root directory which, immediately
+after mounting, owns all the tasks and cpus in the system and can make
+full use of all resources.
+
+On a system with RDT control features additional directories can be
+created in the root directory that specify different amounts of each
+resource (see "schemata" below). The root and these additional top level
+directories are referred to as "CTRL_MON" groups below.
+
+On a system with RDT monitoring the root directory and other top level
+directories contain a directory named "mon_groups" in which additional
+directories can be created to monitor subsets of tasks in the CTRL_MON
+group that is their ancestor. These are called "MON" groups in the rest
+of this document.
+
+Removing a directory will move all tasks and cpus owned by the group it
+represents to the parent. Removing one of the created CTRL_MON groups
+will automatically remove all MON groups below it.
+
+All groups contain the following files:
+
+"tasks":
+ Reading this file shows the list of all tasks that belong to
+ this group. Writing a task id to the file will add a task to the
+ group. If the group is a CTRL_MON group the task is removed from
+ whichever previous CTRL_MON group owned the task and also from
+ any MON group that owned the task. If the group is a MON group,
+ then the task must already belong to the CTRL_MON parent of this
+ group. The task is removed from any previous MON group.
+
+
+"cpus":
+ Reading this file shows a bitmask of the logical CPUs owned by
+ this group. Writing a mask to this file will add and remove
+ CPUs to/from this group. As with the tasks file a hierarchy is
+ maintained where MON groups may only include CPUs owned by the
+ parent CTRL_MON group.
+
-There are three files associated with each group:
+"cpus_list":
+ Just like "cpus", only using ranges of CPUs instead of bitmasks.
-"tasks": A list of tasks that belongs to this group. Tasks can be
- added to a group by writing the task ID to the "tasks" file
- (which will automatically remove them from the previous
- group to which they belonged). New tasks created by fork(2)
- and clone(2) are added to the same group as their parent.
- If a pid is not in any sub partition, it is in root partition
- (i.e. default partition).
-"cpus": A bitmask of logical CPUs assigned to this group. Writing
- a new mask can add/remove CPUs from this group. Added CPUs
- are removed from their previous group. Removed ones are
- given to the default (root) group. You cannot remove CPUs
- from the default group.
+When control is enabled all CTRL_MON groups will also contain:
-"cpus_list": One or more CPU ranges of logical CPUs assigned to this
- group. Same rules apply like for the "cpus" file.
+"schemata":
+ A list of all the resources available to this group.
+ Each resource has its own line and format - see below for details.
-"schemata": A list of all the resources available to this group.
- Each resource has its own line and format - see below for
- details.
+When monitoring is enabled all MON groups will also contain:
-When a task is running the following rules define which resources
-are available to it:
+"mon_data":
+ This contains a set of files organized by L3 domain and by
+ RDT event. E.g. on a system with two L3 domains there will
+ be subdirectories "mon_L3_00" and "mon_L3_01". Each of these
+ directories have one file per event (e.g. "llc_occupancy",
+ "mbm_total_bytes", and "mbm_local_bytes"). In a MON group these
+ files provide a read out of the current value of the event for
+ all tasks in the group. In CTRL_MON groups these files provide
+ the sum for all tasks in the CTRL_MON group and all tasks in
+ MON groups. Please see example section for more details on usage.
+
+Resource allocation rules
+-------------------------
+When a task is running the following rules define which resources are
+available to it:
1) If the task is a member of a non-default group, then the schemata
-for that group is used.
+ for that group is used.
2) Else if the task belongs to the default group, but is running on a
-CPU that is assigned to some specific group, then the schemata for
-the CPU's group is used.
+ CPU that is assigned to some specific group, then the schemata for the
+ CPU's group is used.
3) Otherwise the schemata for the default group is used.
+Resource monitoring rules
+-------------------------
+1) If a task is a member of a MON group, or non-default CTRL_MON group
+ then RDT events for the task will be reported in that group.
+
+2) If a task is a member of the default CTRL_MON group, but is running
+ on a CPU that is assigned to some specific group, then the RDT events
+ for the task will be reported in that group.
+
+3) Otherwise RDT events for the task will be reported in the root level
+ "mon_data" group.
+
+
+Notes on cache occupancy monitoring and control
+-----------------------------------------------
+When moving a task from one group to another you should remember that
+this only affects *new* cache allocations by the task. E.g. you may have
+a task in a monitor group showing 3 MB of cache occupancy. If you move
+to a new group and immediately check the occupancy of the old and new
+groups you will likely see that the old group is still showing 3 MB and
+the new group zero. When the task accesses locations still in cache from
+before the move, the h/w does not update any counters. On a busy system
+you will likely see the occupancy in the old group go down as cache lines
+are evicted and re-used while the occupancy in the new group rises as
+the task accesses memory and loads into the cache are counted based on
+membership in the new group.
+
+The same applies to cache allocation control. Moving a task to a group
+with a smaller cache partition will not evict any cache lines. The
+process may continue to use them from the old partition.
+
+Hardware uses CLOSid(Class of service ID) and an RMID(Resource monitoring ID)
+to identify a control group and a monitoring group respectively. Each of
+the resource groups are mapped to these IDs based on the kind of group. The
+number of CLOSid and RMID are limited by the hardware and hence the creation of
+a "CTRL_MON" directory may fail if we run out of either CLOSID or RMID
+and creation of "MON" group may fail if we run out of RMIDs.
+
+max_threshold_occupancy - generic concepts
+------------------------------------------
+
+Note that an RMID once freed may not be immediately available for use as
+the RMID is still tagged the cache lines of the previous user of RMID.
+Hence such RMIDs are placed on limbo list and checked back if the cache
+occupancy has gone down. If there is a time when system has a lot of
+limbo RMIDs but which are not ready to be used, user may see an -EBUSY
+during mkdir.
+
+max_threshold_occupancy is a user configurable value to determine the
+occupancy at which an RMID can be freed.
Schemata files - general concepts
---------------------------------
@@ -143,22 +267,22 @@ SKUs. Using a high bandwidth and a low bandwidth setting on two threads
sharing a core will result in both threads being throttled to use the
low bandwidth.
-L3 details (code and data prioritization disabled)
---------------------------------------------------
+L3 schemata file details (code and data prioritization disabled)
+----------------------------------------------------------------
With CDP disabled the L3 schemata format is:
L3:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
-L3 details (CDP enabled via mount option to resctrl)
-----------------------------------------------------
+L3 schemata file details (CDP enabled via mount option to resctrl)
+------------------------------------------------------------------
When CDP is enabled L3 control is split into two separate resources
so you can specify independent masks for code and data like this:
L3data:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
L3code:<cache_id0>=<cbm>;<cache_id1>=<cbm>;...
-L2 details
-----------
+L2 schemata file details
+------------------------
L2 cache does not support code and data prioritization, so the
schemata format is always:
@@ -185,6 +309,8 @@ L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
L3DATA:0=fffff;1=fffff;2=3c0;3=fffff
L3CODE:0=fffff;1=fffff;2=fffff;3=fffff
+Examples for RDT allocation usage:
+
Example 1
---------
On a two socket machine (one L3 cache per socket) with just four bits
@@ -410,3 +536,124 @@ void main(void)
/* code to read and write directory contents */
resctrl_release_lock(fd);
}
+
+Examples for RDT Monitoring along with allocation usage:
+
+Reading monitored data
+----------------------
+Reading an event file (for ex: mon_data/mon_L3_00/llc_occupancy) would
+show the current snapshot of LLC occupancy of the corresponding MON
+group or CTRL_MON group.
+
+
+Example 1 (Monitor CTRL_MON group and subset of tasks in CTRL_MON group)
+---------
+On a two socket machine (one L3 cache per socket) with just four bits
+for cache bit masks
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p0 p1
+# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
+# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata
+# echo 5678 > p1/tasks
+# echo 5679 > p1/tasks
+
+The default resource group is unmodified, so we have access to all parts
+of all caches (its schemata file reads "L3:0=f;1=f").
+
+Tasks that are under the control of group "p0" may only allocate from the
+"lower" 50% on cache ID 0, and the "upper" 50% of cache ID 1.
+Tasks in group "p1" use the "lower" 50% of cache on both sockets.
+
+Create monitor groups and assign a subset of tasks to each monitor group.
+
+# cd /sys/fs/resctrl/p1/mon_groups
+# mkdir m11 m12
+# echo 5678 > m11/tasks
+# echo 5679 > m12/tasks
+
+fetch data (data shown in bytes)
+
+# cat m11/mon_data/mon_L3_00/llc_occupancy
+16234000
+# cat m11/mon_data/mon_L3_01/llc_occupancy
+14789000
+# cat m12/mon_data/mon_L3_00/llc_occupancy
+16789000
+
+The parent ctrl_mon group shows the aggregated data.
+
+# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+31234000
+
+Example 2 (Monitor a task from its creation)
+---------
+On a two socket machine (one L3 cache per socket)
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p0 p1
+
+An RMID is allocated to the group once its created and hence the <cmd>
+below is monitored from its creation.
+
+# echo $$ > /sys/fs/resctrl/p1/tasks
+# <cmd>
+
+Fetch the data
+
+# cat /sys/fs/resctrl/p1/mon_data/mon_l3_00/llc_occupancy
+31789000
+
+Example 3 (Monitor without CAT support or before creating CAT groups)
+---------
+
+Assume a system like HSW has only CQM and no CAT support. In this case
+the resctrl will still mount but cannot create CTRL_MON directories.
+But user can create different MON groups within the root group thereby
+able to monitor all tasks including kernel threads.
+
+This can also be used to profile jobs cache size footprint before being
+able to allocate them to different allocation groups.
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir mon_groups/m01
+# mkdir mon_groups/m02
+
+# echo 3478 > /sys/fs/resctrl/mon_groups/m01/tasks
+# echo 2467 > /sys/fs/resctrl/mon_groups/m02/tasks
+
+Monitor the groups separately and also get per domain data. From the
+below its apparent that the tasks are mostly doing work on
+domain(socket) 0.
+
+# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_00/llc_occupancy
+31234000
+# cat /sys/fs/resctrl/mon_groups/m01/mon_L3_01/llc_occupancy
+34555
+# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_00/llc_occupancy
+31234000
+# cat /sys/fs/resctrl/mon_groups/m02/mon_L3_01/llc_occupancy
+32789
+
+
+Example 4 (Monitor real time tasks)
+-----------------------------------
+
+A single socket system which has real time tasks running on cores 4-7
+and non real time tasks on other cpus. We want to monitor the cache
+occupancy of the real time threads on these cores.
+
+# mount -t resctrl resctrl /sys/fs/resctrl
+# cd /sys/fs/resctrl
+# mkdir p1
+
+Move the cpus 4-7 over to p1
+# echo f0 > p0/cpus
+
+View the llc occupancy snapshot
+
+# cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy
+11234000
diff --git a/Documentation/x86/microcode.txt b/Documentation/x86/microcode.txt
new file mode 100644
index 000000000000..f57e1b45e628
--- /dev/null
+++ b/Documentation/x86/microcode.txt
@@ -0,0 +1,137 @@
+ The Linux Microcode Loader
+
+Authors: Fenghua Yu <fenghua.yu@intel.com>
+ Borislav Petkov <bp@suse.de>
+
+The kernel has a x86 microcode loading facility which is supposed to
+provide microcode loading methods in the OS. Potential use cases are
+updating the microcode on platforms beyond the OEM End-Of-Life support,
+and updating the microcode on long-running systems without rebooting.
+
+The loader supports three loading methods:
+
+1. Early load microcode
+=======================
+
+The kernel can update microcode very early during boot. Loading
+microcode early can fix CPU issues before they are observed during
+kernel boot time.
+
+The microcode is stored in an initrd file. During boot, it is read from
+it and loaded into the CPU cores.
+
+The format of the combined initrd image is microcode in (uncompressed)
+cpio format followed by the (possibly compressed) initrd image. The
+loader parses the combined initrd image during boot.
+
+The microcode files in cpio name space are:
+
+on Intel: kernel/x86/microcode/GenuineIntel.bin
+on AMD : kernel/x86/microcode/AuthenticAMD.bin
+
+During BSP (BootStrapping Processor) boot (pre-SMP), the kernel
+scans the microcode file in the initrd. If microcode matching the
+CPU is found, it will be applied in the BSP and later on in all APs
+(Application Processors).
+
+The loader also saves the matching microcode for the CPU in memory.
+Thus, the cached microcode patch is applied when CPUs resume from a
+sleep state.
+
+Here's a crude example how to prepare an initrd with microcode (this is
+normally done automatically by the distribution, when recreating the
+initrd, so you don't really have to do it yourself. It is documented
+here for future reference only).
+
+---
+ #!/bin/bash
+
+ if [ -z "$1" ]; then
+ echo "You need to supply an initrd file"
+ exit 1
+ fi
+
+ INITRD="$1"
+
+ DSTDIR=kernel/x86/microcode
+ TMPDIR=/tmp/initrd
+
+ rm -rf $TMPDIR
+
+ mkdir $TMPDIR
+ cd $TMPDIR
+ mkdir -p $DSTDIR
+
+ if [ -d /lib/firmware/amd-ucode ]; then
+ cat /lib/firmware/amd-ucode/microcode_amd*.bin > $DSTDIR/AuthenticAMD.bin
+ fi
+
+ if [ -d /lib/firmware/intel-ucode ]; then
+ cat /lib/firmware/intel-ucode/* > $DSTDIR/GenuineIntel.bin
+ fi
+
+ find . | cpio -o -H newc >../ucode.cpio
+ cd ..
+ mv $INITRD $INITRD.orig
+ cat ucode.cpio $INITRD.orig > $INITRD
+
+ rm -rf $TMPDIR
+---
+
+The system needs to have the microcode packages installed into
+/lib/firmware or you need to fixup the paths above if yours are
+somewhere else and/or you've downloaded them directly from the processor
+vendor's site.
+
+2. Late loading
+===============
+
+There are two legacy user space interfaces to load microcode, either through
+/dev/cpu/microcode or through /sys/devices/system/cpu/microcode/reload file
+in sysfs.
+
+The /dev/cpu/microcode method is deprecated because it needs a special
+userspace tool for that.
+
+The easier method is simply installing the microcode packages your distro
+supplies and running:
+
+# echo 1 > /sys/devices/system/cpu/microcode/reload
+
+as root.
+
+The loading mechanism looks for microcode blobs in
+/lib/firmware/{intel-ucode,amd-ucode}. The default distro installation
+packages already put them there.
+
+3. Builtin microcode
+====================
+
+The loader supports also loading of a builtin microcode supplied through
+the regular firmware builtin method CONFIG_FIRMWARE_IN_KERNEL. Only
+64-bit is currently supported.
+
+Here's an example:
+
+CONFIG_FIRMWARE_IN_KERNEL=y
+CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09 amd-ucode/microcode_amd_fam15h.bin"
+CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware"
+
+This basically means, you have the following tree structure locally:
+
+/lib/firmware/
+|-- amd-ucode
+...
+| |-- microcode_amd_fam15h.bin
+...
+|-- intel-ucode
+...
+| |-- 06-3a-09
+...
+
+so that the build system can find those files and integrate them into
+the final kernel image. The early loader finds them and applies them.
+
+Needless to say, this method is not the most flexible one because it
+requires rebuilding the kernel each time updated microcode from the CPU
+vendor is available.
diff --git a/Documentation/x86/orc-unwinder.txt b/Documentation/x86/orc-unwinder.txt
new file mode 100644
index 000000000000..af0c9a4c65a6
--- /dev/null
+++ b/Documentation/x86/orc-unwinder.txt
@@ -0,0 +1,179 @@
+ORC unwinder
+============
+
+Overview
+--------
+
+The kernel CONFIG_ORC_UNWINDER option enables the ORC unwinder, which is
+similar in concept to a DWARF unwinder. The difference is that the
+format of the ORC data is much simpler than DWARF, which in turn allows
+the ORC unwinder to be much simpler and faster.
+
+The ORC data consists of unwind tables which are generated by objtool.
+They contain out-of-band data which is used by the in-kernel ORC
+unwinder. Objtool generates the ORC data by first doing compile-time
+stack metadata validation (CONFIG_STACK_VALIDATION). After analyzing
+all the code paths of a .o file, it determines information about the
+stack state at each instruction address in the file and outputs that
+information to the .orc_unwind and .orc_unwind_ip sections.
+
+The per-object ORC sections are combined at link time and are sorted and
+post-processed at boot time. The unwinder uses the resulting data to
+correlate instruction addresses with their stack states at run time.
+
+
+ORC vs frame pointers
+---------------------
+
+With frame pointers enabled, GCC adds instrumentation code to every
+function in the kernel. The kernel's .text size increases by about
+3.2%, resulting in a broad kernel-wide slowdown. Measurements by Mel
+Gorman [1] have shown a slowdown of 5-10% for some workloads.
+
+In contrast, the ORC unwinder has no effect on text size or runtime
+performance, because the debuginfo is out of band. So if you disable
+frame pointers and enable the ORC unwinder, you get a nice performance
+improvement across the board, and still have reliable stack traces.
+
+Ingo Molnar says:
+
+ "Note that it's not just a performance improvement, but also an
+ instruction cache locality improvement: 3.2% .text savings almost
+ directly transform into a similarly sized reduction in cache
+ footprint. That can transform to even higher speedups for workloads
+ whose cache locality is borderline."
+
+Another benefit of ORC compared to frame pointers is that it can
+reliably unwind across interrupts and exceptions. Frame pointer based
+unwinds can sometimes skip the caller of the interrupted function, if it
+was a leaf function or if the interrupt hit before the frame pointer was
+saved.
+
+The main disadvantage of the ORC unwinder compared to frame pointers is
+that it needs more memory to store the ORC unwind tables: roughly 2-4MB
+depending on the kernel config.
+
+
+ORC vs DWARF
+------------
+
+ORC debuginfo's advantage over DWARF itself is that it's much simpler.
+It gets rid of the complex DWARF CFI state machine and also gets rid of
+the tracking of unnecessary registers. This allows the unwinder to be
+much simpler, meaning fewer bugs, which is especially important for
+mission critical oops code.
+
+The simpler debuginfo format also enables the unwinder to be much faster
+than DWARF, which is important for perf and lockdep. In a basic
+performance test by Jiri Slaby [2], the ORC unwinder was about 20x
+faster than an out-of-tree DWARF unwinder. (Note: That measurement was
+taken before some performance tweaks were added, which doubled
+performance, so the speedup over DWARF may be closer to 40x.)
+
+The ORC data format does have a few downsides compared to DWARF. ORC
+unwind tables take up ~50% more RAM (+1.3MB on an x86 defconfig kernel)
+than DWARF-based eh_frame tables.
+
+Another potential downside is that, as GCC evolves, it's conceivable
+that the ORC data may end up being *too* simple to describe the state of
+the stack for certain optimizations. But IMO this is unlikely because
+GCC saves the frame pointer for any unusual stack adjustments it does,
+so I suspect we'll really only ever need to keep track of the stack
+pointer and the frame pointer between call frames. But even if we do
+end up having to track all the registers DWARF tracks, at least we will
+still be able to control the format, e.g. no complex state machines.
+
+
+ORC unwind table generation
+---------------------------
+
+The ORC data is generated by objtool. With the existing compile-time
+stack metadata validation feature, objtool already follows all code
+paths, and so it already has all the information it needs to be able to
+generate ORC data from scratch. So it's an easy step to go from stack
+validation to ORC data generation.
+
+It should be possible to instead generate the ORC data with a simple
+tool which converts DWARF to ORC data. However, such a solution would
+be incomplete due to the kernel's extensive use of asm, inline asm, and
+special sections like exception tables.
+
+That could be rectified by manually annotating those special code paths
+using GNU assembler .cfi annotations in .S files, and homegrown
+annotations for inline asm in .c files. But asm annotations were tried
+in the past and were found to be unmaintainable. They were often
+incorrect/incomplete and made the code harder to read and keep updated.
+And based on looking at glibc code, annotating inline asm in .c files
+might be even worse.
+
+Objtool still needs a few annotations, but only in code which does
+unusual things to the stack like entry code. And even then, far fewer
+annotations are needed than what DWARF would need, so they're much more
+maintainable than DWARF CFI annotations.
+
+So the advantages of using objtool to generate ORC data are that it
+gives more accurate debuginfo, with very few annotations. It also
+insulates the kernel from toolchain bugs which can be very painful to
+deal with in the kernel since we often have to workaround issues in
+older versions of the toolchain for years.
+
+The downside is that the unwinder now becomes dependent on objtool's
+ability to reverse engineer GCC code flow. If GCC optimizations become
+too complicated for objtool to follow, the ORC data generation might
+stop working or become incomplete. (It's worth noting that livepatch
+already has such a dependency on objtool's ability to follow GCC code
+flow.)
+
+If newer versions of GCC come up with some optimizations which break
+objtool, we may need to revisit the current implementation. Some
+possible solutions would be asking GCC to make the optimizations more
+palatable, or having objtool use DWARF as an additional input, or
+creating a GCC plugin to assist objtool with its analysis. But for now,
+objtool follows GCC code quite well.
+
+
+Unwinder implementation details
+-------------------------------
+
+Objtool generates the ORC data by integrating with the compile-time
+stack metadata validation feature, which is described in detail in
+tools/objtool/Documentation/stack-validation.txt. After analyzing all
+the code paths of a .o file, it creates an array of orc_entry structs,
+and a parallel array of instruction addresses associated with those
+structs, and writes them to the .orc_unwind and .orc_unwind_ip sections
+respectively.
+
+The ORC data is split into the two arrays for performance reasons, to
+make the searchable part of the data (.orc_unwind_ip) more compact. The
+arrays are sorted in parallel at boot time.
+
+Performance is further improved by the use of a fast lookup table which
+is created at runtime. The fast lookup table associates a given address
+with a range of indices for the .orc_unwind table, so that only a small
+subset of the table needs to be searched.
+
+
+Etymology
+---------
+
+Orcs, fearsome creatures of medieval folklore, are the Dwarves' natural
+enemies. Similarly, the ORC unwinder was created in opposition to the
+complexity and slowness of DWARF.
+
+"Although Orcs rarely consider multiple solutions to a problem, they do
+excel at getting things done because they are creatures of action, not
+thought." [3] Similarly, unlike the esoteric DWARF unwinder, the
+veracious ORC unwinder wastes no time or siloconic effort decoding
+variable-length zero-extended unsigned-integer byte-coded
+state-machine-based debug information entries.
+
+Similar to how Orcs frequently unravel the well-intentioned plans of
+their adversaries, the ORC unwinder frequently unravels stacks with
+brutal, unyielding efficiency.
+
+ORC stands for Oops Rewind Capability.
+
+
+[1] https://lkml.kernel.org/r/20170602104048.jkkzssljsompjdwy@suse.de
+[2] https://lkml.kernel.org/r/d2ca5435-6386-29b8-db87-7f227c2b713a@suse.cz
+[3] http://dustin.wikidot.com/half-orcs-and-orcs