diff options
258 files changed, 11242 insertions, 2893 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index cd556b914786..68b6a6a470b0 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt @@ -387,8 +387,81 @@ groups and put applications in that group which are not driving enough IO to keep disk busy. In that case set group_idle=0, and CFQ will not idle on individual groups and throughput should improve. -What works -========== -- Currently only sync IO queues are support. All the buffered writes are - still system wide and not per group. Hence we will not see service - differentiation between buffered writes between groups. +Writeback +========= + +Page cache is dirtied through buffered writes and shared mmaps and +written asynchronously to the backing filesystem by the writeback +mechanism. Writeback sits between the memory and IO domains and +regulates the proportion of dirty memory by balancing dirtying and +write IOs. + +On traditional cgroup hierarchies, relationships between different +controllers cannot be established making it impossible for writeback +to operate accounting for cgroup resource restrictions and all +writeback IOs are attributed to the root cgroup. + +If both the blkio and memory controllers are used on the v2 hierarchy +and the filesystem supports cgroup writeback, writeback operations +correctly follow the resource restrictions imposed by both memory and +blkio controllers. + +Writeback examines both system-wide and per-cgroup dirty memory status +and enforces the more restrictive of the two. Also, writeback control +parameters which are absolute values - vm.dirty_bytes and +vm.dirty_background_bytes - are distributed across cgroups according +to their current writeback bandwidth. + +There's a peculiarity stemming from the discrepancy in ownership +granularity between memory controller and writeback. While memory +controller tracks ownership per page, writeback operates on inode +basis. cgroup writeback bridges the gap by tracking ownership by +inode but migrating ownership if too many foreign pages, pages which +don't match the current inode ownership, have been encountered while +writing back the inode. + +This is a conscious design choice as writeback operations are +inherently tied to inodes making strictly following page ownership +complicated and inefficient. The only use case which suffers from +this compromise is multiple cgroups concurrently dirtying disjoint +regions of the same inode, which is an unlikely use case and decided +to be unsupported. Note that as memory controller assigns page +ownership on the first use and doesn't update it until the page is +released, even if cgroup writeback strictly follows page ownership, +multiple cgroups dirtying overlapping areas wouldn't work as expected. +In general, write-sharing an inode across multiple cgroups is not well +supported. + +Filesystem support for cgroup writeback +--------------------------------------- + +A filesystem can make writeback IOs cgroup-aware by updating +address_space_operations->writepage[s]() to annotate bio's using the +following two functions. + +* wbc_init_bio(@wbc, @bio) + + Should be called for each bio carrying writeback data and associates + the bio with the inode's owner cgroup. Can be called anytime + between bio allocation and submission. + +* wbc_account_io(@wbc, @page, @bytes) + + Should be called for each data segment being written out. While + this function doesn't care exactly when it's called during the + writeback session, it's the easiest and most natural to call it as + data segments are added to a bio. + +With writeback bio's annotated, cgroup support can be enabled per +super_block by setting MS_CGROUPWB in ->s_flags. This allows for +selective disabling of cgroup writeback support which is helpful when +certain filesystem features, e.g. journaled data mode, are +incompatible. + +wbc_init_bio() binds the specified bio to its cgroup. Depending on +the configuration, the bio may be executed at a lower priority and if +the writeback session is holding shared resources, e.g. a journal +entry, may lead to priority inversion. There is no one easy solution +for the problem. Filesystems can try to work around specific problem +cases by skipping wbc_init_bio() or using bio_associate_blkcg() +directly. diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index f456b4315e86..ff71e16cc752 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -493,6 +493,7 @@ pgpgin - # of charging events to the memory cgroup. The charging pgpgout - # of uncharging events to the memory cgroup. The uncharging event happens each time a page is unaccounted from the cgroup. swap - # of bytes of swap usage +dirty - # of bytes that are waiting to get written back to the disk. writeback - # of bytes of file/anon cache that are queued for syncing to disk. inactive_anon - # of bytes of anonymous and swap cache memory on inactive diff --git a/Documentation/devicetree/bindings/arm/psci.txt b/Documentation/devicetree/bindings/arm/psci.txt index 5aa40ede0e99..a9adab84e2fe 100644 --- a/Documentation/devicetree/bindings/arm/psci.txt +++ b/Documentation/devicetree/bindings/arm/psci.txt @@ -31,6 +31,10 @@ Main node required properties: support, but are permitted to be present for compatibility with existing software when "arm,psci" is later in the compatible list. + * "arm,psci-1.0" : for implementations complying to PSCI 1.0. PSCI 1.0 is + backward compatible with PSCI 0.2 with minor specification updates, + as defined in the PSCI specification[2]. + - method : The method of calling the PSCI firmware. Permitted values are: @@ -100,3 +104,5 @@ Case 3: PSCI v0.2 and PSCI v0.1. [1] Kernel documentation - ARM idle states bindings Documentation/devicetree/bindings/arm/idle-states.txt +[2] Power State Coordination Interface (PSCI) specification + http://infocenter.arm.com/help/topic/com.arm.doc.den0022c/DEN0022C_Power_State_Coordination_Interface.pdf diff --git a/Documentation/devicetree/bindings/pci/arm,juno-r1-pcie.txt b/Documentation/devicetree/bindings/pci/arm,juno-r1-pcie.txt new file mode 100644 index 000000000000..f7514c170a32 --- /dev/null +++ b/Documentation/devicetree/bindings/pci/arm,juno-r1-pcie.txt @@ -0,0 +1,10 @@ +* ARM Juno R1 PCIe interface + +This PCIe host controller is based on PLDA XpressRICH3-AXI IP +and thus inherits all the common properties defined in plda,xpressrich3-axi.txt +as well as the base properties defined in host-generic-pci.txt. + +Required properties: + - compatible: "arm,juno-r1-pcie" + - dma-coherent: The host controller bridges the AXI transactions into PCIe bus + in a manner that makes the DMA operations to appear coherent to the CPUs. diff --git a/Documentation/devicetree/bindings/pci/plda,xpressrich3-axi.txt b/Documentation/devicetree/bindings/pci/plda,xpressrich3-axi.txt new file mode 100644 index 000000000000..f3f75bfb42bc --- /dev/null +++ b/Documentation/devicetree/bindings/pci/plda,xpressrich3-axi.txt @@ -0,0 +1,12 @@ +* PLDA XpressRICH3-AXI host controller + +The PLDA XpressRICH3-AXI host controller can be configured in a manner that +makes it compliant with the SBSA[1] standard published by ARM Ltd. For those +scenarios, the host-generic-pci.txt bindings apply with the following additions +to the compatible property: + +Required properties: + - compatible: should contain "plda,xpressrich3-axi" to identify the IP used. + + +[1] http://infocenter.arm.com/help/topic/com.arm.doc.den0029a/ diff --git a/Documentation/devicetree/bindings/pci/xgene-pci-msi.txt b/Documentation/devicetree/bindings/pci/xgene-pci-msi.txt new file mode 100644 index 000000000000..36d881c8e6d4 --- /dev/null +++ b/Documentation/devicetree/bindings/pci/xgene-pci-msi.txt @@ -0,0 +1,68 @@ +* AppliedMicro X-Gene v1 PCIe MSI controller + +Required properties: + +- compatible: should be "apm,xgene1-msi" to identify + X-Gene v1 PCIe MSI controller block. +- msi-controller: indicates that this is X-Gene v1 PCIe MSI controller node +- reg: physical base address (0x79000000) and length (0x900000) for controller + registers. These registers include the MSI termination address and data + registers as well as the MSI interrupt status registers. +- reg-names: not required +- interrupts: A list of 16 interrupt outputs of the controller, starting from + interrupt number 0x10 to 0x1f. +- interrupt-names: not required + +Each PCIe node needs to have property msi-parent that points to msi controller node + +Examples: + +SoC DTSI: + + + MSI node: + msi@79000000 { + compatible = "apm,xgene1-msi"; + msi-controller; + reg = <0x00 0x79000000 0x0 0x900000>; + interrupts = <0x0 0x10 0x4> + <0x0 0x11 0x4> + <0x0 0x12 0x4> + <0x0 0x13 0x4> + <0x0 0x14 0x4> + <0x0 0x15 0x4> + <0x0 0x16 0x4> + <0x0 0x17 0x4> + <0x0 0x18 0x4> + <0x0 0x19 0x4> + <0x0 0x1a 0x4> + <0x0 0x1b 0x4> + <0x0 0x1c 0x4> + <0x0 0x1d 0x4> + <0x0 0x1e 0x4> + <0x0 0x1f 0x4>; + }; + + + PCIe controller node with msi-parent property pointing to MSI node: + pcie0: pcie@1f2b0000 { + status = "disabled"; + device_type = "pci"; + compatible = "apm,xgene-storm-pcie", "apm,xgene-pcie"; + #interrupt-cells = <1>; + #size-cells = <2>; + #address-cells = <3>; + reg = < 0x00 0x1f2b0000 0x0 0x00010000 /* Controller registers */ + 0xe0 0xd0000000 0x0 0x00040000>; /* PCI config space */ + reg-names = "csr", "cfg"; + ranges = <0x01000000 0x00 0x00000000 0xe0 0x10000000 0x00 0x00010000 /* io */ + 0x02000000 0x00 0x80000000 0xe1 0x80000000 0x00 0x80000000>; /* mem */ + dma-ranges = <0x42000000 0x80 0x00000000 0x80 0x00000000 0x00 0x80000000 + 0x42000000 0x00 0x00000000 0x00 0x00000000 0x80 0x00000000>; + interrupt-map-mask = <0x0 0x0 0x0 0x7>; + interrupt-map = <0x0 0x0 0x0 0x1 &gic 0x0 0xc2 0x1 + 0x0 0x0 0x0 0x2 &gic 0x0 0xc3 0x1 + 0x0 0x0 0x0 0x3 &gic 0x0 0xc4 0x1 + 0x0 0x0 0x0 0x4 &gic 0x0 0xc5 0x1>; + dma-coherent; + clocks = <&pcie0clk 0>; + msi-parent= <&msi>; + }; diff --git a/Documentation/devicetree/bindings/thermal/thermal.txt b/Documentation/devicetree/bindings/thermal/thermal.txt index 29fe0bfae38e..8a49362dea6e 100644 --- a/Documentation/devicetree/bindings/thermal/thermal.txt +++ b/Documentation/devicetree/bindings/thermal/thermal.txt @@ -167,6 +167,13 @@ Optional property: by means of sensor ID. Additional coefficients are interpreted as constant offset. +- sustainable-power: An estimate of the sustainable power (in mW) that the + Type: unsigned thermal zone can dissipate at the desired + Size: one cell control temperature. For reference, the + sustainable power of a 4'' phone is typically + 2000mW, while on a 10'' tablet is around + 4500mW. + Note: The delay properties are bound to the maximum dT/dt (temperature derivative over time) in two situations for a thermal zone: (i) - when passive cooling is activated (polling-delay-passive); and @@ -546,6 +553,8 @@ thermal-zones { */ coefficients = <1200 -345 890>; + sustainable-power = <2500>; + trips { /* Trips are based on resulting linear equation */ cpu_trip: cpu-trip { diff --git a/Documentation/devicetree/bindings/vendor-prefixes.txt b/Documentation/devicetree/bindings/vendor-prefixes.txt index 80339192c93e..e2b67808fdcd 100644 --- a/Documentation/devicetree/bindings/vendor-prefixes.txt +++ b/Documentation/devicetree/bindings/vendor-prefixes.txt @@ -149,6 +149,7 @@ pericom Pericom Technology Inc. phytec PHYTEC Messtechnik GmbH picochip Picochip Ltd plathome Plat'Home Co., Ltd. +plda PLDA pixcir PIXCIR MICROELECTRONICS Co., Ltd powervr PowerVR (deprecated, use img) qca Qualcomm Atheros, Inc. diff --git a/Documentation/thermal/cpu-cooling-api.txt b/Documentation/thermal/cpu-cooling-api.txt index 753e47cc2e20..71653584cd03 100644 --- a/Documentation/thermal/cpu-cooling-api.txt +++ b/Documentation/thermal/cpu-cooling-api.txt @@ -36,8 +36,162 @@ the user. The registration APIs returns the cooling device pointer. np: pointer to the cooling device device tree node clip_cpus: cpumask of cpus where the frequency constraints will happen. -1.1.3 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev) +1.1.3 struct thermal_cooling_device *cpufreq_power_cooling_register( + const struct cpumask *clip_cpus, u32 capacitance, + get_static_t plat_static_func) + +Similar to cpufreq_cooling_register, this function registers a cpufreq +cooling device. Using this function, the cooling device will +implement the power extensions by using a simple cpu power model. The +cpus must have registered their OPPs using the OPP library. + +The additional parameters are needed for the power model (See 2. Power +models). "capacitance" is the dynamic power coefficient (See 2.1 +Dynamic power). "plat_static_func" is a function to calculate the +static power consumed by these cpus (See 2.2 Static power). + +1.1.4 struct thermal_cooling_device *of_cpufreq_power_cooling_register( + struct device_node *np, const struct cpumask *clip_cpus, u32 capacitance, + get_static_t plat_static_func) + +Similar to cpufreq_power_cooling_register, this function register a +cpufreq cooling device with power extensions using the device tree +information supplied by the np parameter. + +1.1.5 void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev) This interface function unregisters the "thermal-cpufreq-%x" cooling device. cdev: Cooling device pointer which has to be unregistered. + +2. Power models + +The power API registration functions provide a simple power model for +CPUs. The current power is calculated as dynamic + (optionally) +static power. This power model requires that the operating-points of +the CPUs are registered using the kernel's opp library and the +`cpufreq_frequency_table` is assigned to the `struct device` of the +cpu. If you are using CONFIG_CPUFREQ_DT then the +`cpufreq_frequency_table` should already be assigned to the cpu +device. + +The `plat_static_func` parameter of `cpufreq_power_cooling_register()` +and `of_cpufreq_power_cooling_register()` is optional. If you don't +provide it, only dynamic power will be considered. + +2.1 Dynamic power + +The dynamic power consumption of a processor depends on many factors. +For a given processor implementation the primary factors are: + +- The time the processor spends running, consuming dynamic power, as + compared to the time in idle states where dynamic consumption is + negligible. Herein we refer to this as 'utilisation'. +- The voltage and frequency levels as a result of DVFS. The DVFS + level is a dominant factor governing power consumption. +- In running time the 'execution' behaviour (instruction types, memory + access patterns and so forth) causes, in most cases, a second order + variation. In pathological cases this variation can be significant, + but typically it is of a much lesser impact than the factors above. + +A high level dynamic power consumption model may then be represented as: + +Pdyn = f(run) * Voltage^2 * Frequency * Utilisation + +f(run) here represents the described execution behaviour and its +result has a units of Watts/Hz/Volt^2 (this often expressed in +mW/MHz/uVolt^2) + +The detailed behaviour for f(run) could be modelled on-line. However, +in practice, such an on-line model has dependencies on a number of +implementation specific processor support and characterisation +factors. Therefore, in initial implementation that contribution is +represented as a constant coefficient. This is a simplification +consistent with the relative contribution to overall power variation. + +In this simplified representation our model becomes: + +Pdyn = Capacitance * Voltage^2 * Frequency * Utilisation + +Where `capacitance` is a constant that represents an indicative +running time dynamic power coefficient in fundamental units of +mW/MHz/uVolt^2. Typical values for mobile CPUs might lie in range +from 100 to 500. For reference, the approximate values for the SoC in +ARM's Juno Development Platform are 530 for the Cortex-A57 cluster and +140 for the Cortex-A53 cluster. + + +2.2 Static power + +Static leakage power consumption depends on a number of factors. For a +given circuit implementation the primary factors are: + +- Time the circuit spends in each 'power state' +- Temperature +- Operating voltage +- Process grade + +The time the circuit spends in each 'power state' for a given +evaluation period at first order means OFF or ON. However, +'retention' states can also be supported that reduce power during +inactive periods without loss of context. + +Note: The visibility of state entries to the OS can vary, according to +platform specifics, and this can then impact the accuracy of a model +based on OS state information alone. It might be possible in some +cases to extract more accurate information from system resources. + +The temperature, operating voltage and process 'grade' (slow to fast) +of the circuit are all significant factors in static leakage power +consumption. All of these have complex relationships to static power. + +Circuit implementation specific factors include the chosen silicon +process as well as the type, number and size of transistors in both +the logic gates and any RAM elements included. + +The static power consumption modelling must take into account the +power managed regions that are implemented. Taking the example of an +ARM processor cluster, the modelling would take into account whether +each CPU can be powered OFF separately or if only a single power +region is implemented for the complete cluster. + +In one view, there are others, a static power consumption model can +then start from a set of reference values for each power managed +region (e.g. CPU, Cluster/L2) in each state (e.g. ON, OFF) at an +arbitrary process grade, voltage and temperature point. These values +are then scaled for all of the following: the time in each state, the +process grade, the current temperature and the operating voltage. +However, since both implementation specific and complex relationships +dominate the estimate, the appropriate interface to the model from the +cpu cooling device is to provide a function callback that calculates +the static power in this platform. When registering the cpu cooling +device pass a function pointer that follows the `get_static_t` +prototype: + + int plat_get_static(cpumask_t *cpumask, int interval, + unsigned long voltage, u32 &power); + +`cpumask` is the cpumask of the cpus involved in the calculation. +`voltage` is the voltage at which they are operating. The function +should calculate the average static power for the last `interval` +milliseconds. It returns 0 on success, -E* on error. If it +succeeds, it should store the static power in `power`. Reading the +temperature of the cpus described by `cpumask` is left for +plat_get_static() to do as the platform knows best which thermal +sensor is closest to the cpu. + +If `plat_static_func` is NULL, static power is considered to be +negligible for this platform and only dynamic power is considered. + +The platform specific callback can then use any combination of tables +and/or equations to permute the estimated value. Process grade +information is not passed to the model since access to such data, from +on-chip measurement capability or manufacture time data, is platform +specific. + +Note: the significance of static power for CPUs in comparison to +dynamic power is highly dependent on implementation. Given the +potential complexity in implementation, the importance and accuracy of +its inclusion when using cpu cooling devices should be assessed on a +case by case basis. + diff --git a/Documentation/thermal/power_allocator.txt b/Documentation/thermal/power_allocator.txt new file mode 100644 index 000000000000..c3797b529991 --- /dev/null +++ b/Documentation/thermal/power_allocator.txt @@ -0,0 +1,247 @@ +Power allocator governor tunables +================================= + +Trip points +----------- + +The governor requires the following two passive trip points: + +1. "switch on" trip point: temperature above which the governor + control loop starts operating. This is the first passive trip + point of the thermal zone. + +2. "desired temperature" trip point: it should be higher than the + "switch on" trip point. This the target temperature the governor + is controlling for. This is the last passive trip point of the + thermal zone. + +PID Controller +-------------- + +The power allocator governor implements a +Proportional-Integral-Derivative controller (PID controller) with +temperature as the control input and power as the controlled output: + + P_max = k_p * e + k_i * err_integral + k_d * diff_err + sustainable_power + +where + e = desired_temperature - current_temperature + err_integral is the sum of previous errors + diff_err = e - previous_error + +It is similar to the one depicted below: + + k_d + | +current_temp | + | v + | +----------+ +---+ + | +----->| diff_err |-->| X |------+ + | | +----------+ +---+ | + | | | tdp actor + | | k_i | | get_requested_power() + | | | | | | | + | | | | | | | ... + v | v v v v v + +---+ | +-------+ +---+ +---+ +---+ +----------+ + | S |-------+----->| sum e |----->| X |--->| S |-->| S |-->|power | + +---+ | +-------+ +---+ +---+ +---+ |allocation| + ^ | ^ +----------+ + | | | | | + | | +---+ | | | + | +------->| X |-------------------+ v v + | +---+ granted performance +desired_temperature ^ + | + | + k_po/k_pu + +Sustainable power +----------------- + +An estimate of the sustainable dissipatable power (in mW) should be +provided while registering the thermal zone. This estimates the +sustained power that can be dissipated at the desired control +temperature. This is the maximum sustained power for allocation at +the desired maximum temperature. The actual sustained power can vary +for a number of reasons. The closed loop controller will take care of +variations such as environmental conditions, and some factors related +to the speed-grade of the silicon. `sustainable_power` is therefore +simply an estimate, and may be tuned to affect the aggressiveness of +the thermal ramp. For reference, the sustainable power of a 4" phone +is typically 2000mW, while on a 10" tablet is around 4500mW (may vary +depending on screen size). + +If you are using device tree, do add it as a property of the +thermal-zone. For example: + + thermal-zones { + soc_thermal { + polling-delay = <1000>; + polling-delay-passive = <100>; + sustainable-power = <2500>; + ... + +Instead, if the thermal zone is registered from the platform code, pass a +`thermal_zone_params` that has a `sustainable_power`. If no +`thermal_zone_params` were being passed, then something like below +will suffice: + + static const struct thermal_zone_params tz_params = { + .sustainable_power = 3500, + }; + +and then pass `tz_params` as the 5th parameter to +`thermal_zone_device_register()` + +k_po and k_pu +------------- + +The implementation of the PID controller in the power allocator +thermal governor allows the configuration of two proportional term +constants: `k_po` and `k_pu`. `k_po` is the proportional term +constant during temperature overshoot periods (current temperature is +above "desired temperature" trip point). Conversely, `k_pu` is the +proportional term constant during temperature undershoot periods +(current temperature below "desired temperature" trip point). + +These controls are intended as the primary mechanism for configuring +the permitted thermal "ramp" of the system. For instance, a lower +`k_pu` value will provide a slower ramp, at the cost of capping +available capacity at a low temperature. On the other hand, a high +value of `k_pu` will result in the governor granting very high power +whilst temperature is low, and may lead to temperature overshooting. + +The default value for `k_pu` is: + + 2 * sustainable_power / (desired_temperature - switch_on_temp) + +This means that at `switch_on_temp` the output of the controller's +proportional term will be 2 * `sustainable_power`. The default value +for `k_po` is: + + sustainable_power / (desired_temperature - switch_on_temp) + +Focusing on the proportional and feed forward values of the PID +controller equation we have: + + P_max = k_p * e + sustainable_power + +The proportional term is proportional to the difference between the +desired temperature and the current one. When the current temperature +is the desired one, then the proportional component is zero and +`P_max` = `sustainable_power`. That is, the system should operate in +thermal equilibrium under constant load. `sustainable_power` is only +an estimate, which is the reason for closed-loop control such as this. + +Expanding `k_pu` we get: + P_max = 2 * sustainable_power * (T_set - T) / (T_set - T_on) + + sustainable_power + +where + T_set is the desired temperature + T is the current temperature + T_on is the switch on temperature + +When the current temperature is the switch_on temperature, the above +formula becomes: + + P_max = 2 * sustainable_power * (T_set - T_on) / (T_set - T_on) + + sustainable_power = 2 * sustainable_power + sustainable_power = + 3 * sustainable_power + +Therefore, the proportional term alone linearly decreases power from +3 * `sustainable_power` to `sustainable_power` as the temperature +rises from the switch on temperature to the desired temperature. + +k_i and integral_cutoff +----------------------- + +`k_i` configures the PID loop's integral term constant. This term +allows the PID controller to compensate for long term drift and for +the quantized nature of the output control: cooling devices can't set +the exact power that the governor requests. When the temperature +error is below `integral_cutoff`, errors are accumulated in the +integral term. This term is then multiplied by `k_i` and the result +added to the output of the controller. Typically `k_i` is set low (1 +or 2) and `integral_cutoff` is 0. + +k_d +--- + +`k_d` configures the PID loop's derivative term constant. It's +recommended to leave it as the default: 0. + +Cooling device power API +======================== + +Cooling devices controlled by this governor must supply the additional +"power" API in their `cooling_device_ops`. It consists on three ops: + +1. int get_requested_power(struct thermal_cooling_device *cdev, + struct thermal_zone_device *tz, u32 *power); +@cdev: The `struct thermal_cooling_device` pointer +@tz: thermal zone in which we are currently operating +@power: pointer in which to store the calculated power + +`get_requested_power()` calculates the power requested by the device +in milliwatts and stores it in @power . It should return 0 on +success, -E* on failure. This is currently used by the power +allocator governor to calculate how much power to give to each cooling +device. + +2. int state2power(struct thermal_cooling_device *cdev, struct + thermal_zone_device *tz, unsigned long state, u32 *power); +@cdev: The `struct thermal_cooling_device` pointer +@tz: thermal zone in which we are currently operating +@state: A cooling device state +@power: pointer in which to store the equivalent power + +Convert cooling device state @state into power consumption in +milliwatts and store it in @power. It should return 0 on success, -E* +on failure. This is currently used by thermal core to calculate the +maximum power that an actor can consume. + +3. int power2state(struct thermal_cooling_device *cdev, u32 power, + unsigned long *state); +@cdev: The `struct thermal_cooling_device` pointer +@power: power in milliwatts +@state: pointer in which to store the resulting state + +Calculate a cooling device state that would make the device consume at +most @power mW and store it in @state. It should return 0 on success, +-E* on failure. This is currently used by the thermal core to convert +a given power set by the power allocator governor to a state that the +cooling device can set. It is a function because this conversion may +depend on external factors that may change so this function should the +best conversion given "current circumstances". + +Cooling device weights +---------------------- + +Weights are a mechanism to bias the allocation among cooling +devices. They express the relative power efficiency of different +cooling devices. Higher weight can be used to express higher power +efficiency. Weighting is relative such that if each cooling device +has a weight of one they are considered equal. This is particularly +useful in heterogeneous systems where two cooling devices may perform +the same kind of compute, but with different efficiency. For example, +a system with two different types of processors. + +If the thermal zone is registered using +`thermal_zone_device_register()` (i.e., platform code), then weights +are passed as part of the thermal zone's `thermal_bind_parameters`. +If the platform is registered using device tree, then they are passed +as the `contribution` property of each map in the `cooling-maps` node. + +Limitations of the power allocator governor +=========================================== + +The power allocator governor's PID controller works best if there is a +periodic tick. If you have a driver that calls +`thermal_zone_device_update()` (or anything that ends up calling the +governor's `throttle()` function) repetitively, the governor response +won't be very good. Note that this is not particular to this +governor, step-wise will also misbehave if you call its throttle() +faster than the normal thermal framework tick (due to interrupts for +example) as it will overreact. diff --git a/Documentation/thermal/sysfs-api.txt b/Documentation/thermal/sysfs-api.txt index 87519cb379ee..c1f6864a8c5d 100644 --- a/Documentation/thermal/sysfs-api.txt +++ b/Documentation/thermal/sysfs-api.txt @@ -95,7 +95,7 @@ temperature) and throttle appropriate devices. 1.3 interface for binding a thermal zone device with a thermal cooling device 1.3.1 int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz, int trip, struct thermal_cooling_device *cdev, - unsigned long upper, unsigned long lower); + unsigned long upper, unsigned long lower, unsigned int weight); This interface function bind a thermal cooling device to the certain trip point of a thermal zone device. @@ -110,6 +110,8 @@ temperature) and throttle appropriate devices. lower:the Minimum cooling state can be used for this trip point. THERMAL_NO_LIMIT means no lower limit, and the cooling device can be in cooling state 0. + weight: the influence of this cooling device in this thermal + zone. See 1.4.1 below for more information. 1.3.2 int thermal_zone_unbind_cooling_device(struct thermal_zone_device *tz, int trip, struct thermal_cooling_device *cdev); @@ -127,9 +129,15 @@ temperature) and throttle appropriate devices. This structure defines the following parameters that are used to bind a zone with a cooling device for a particular trip point. .cdev: The cooling device pointer - .weight: The 'influence' of a particular cooling device on this zone. - This is on a percentage scale. The sum of all these weights - (for a particular zone) cannot exceed 100. + .weight: The 'influence' of a particular cooling device on this + zone. This is relative to the rest of the cooling + devices. For example, if all cooling devices have a + weight of 1, then they all contribute the same. You can + use percentages if you want, but it's not mandatory. A + weight of 0 means that this cooling device doesn't + contribute to the cooling of this zone unless all cooling + devices have a weight of 0. If all weights are 0, then + they all contribute the same. .trip_mask:This is a bit mask that gives the binding relation between this thermal zone and cdev, for a particular trip point. If nth bit is set, then the cdev and thermal zone are bound @@ -176,6 +184,14 @@ Thermal zone device sys I/F, created once it's registered: |---trip_point_[0-*]_type: Trip point type |---trip_point_[0-*]_hyst: Hysteresis value for this trip point |---emul_temp: Emulated temperature set node + |---sustainable_power: Sustainable dissipatable power + |---k_po: Proportional term during temperature overshoot + |---k_pu: Proportional term during temperature undershoot + |---k_i: PID's integral term in the power allocator gov + |---k_d: PID's derivative term in the power allocator + |---integral_cutoff: Offset above which errors are accumulated + |---slope: Slope constant applied as linear extrapolation + |---offset: Offset constant applied as linear extrapolation Thermal cooling device sys I/F, created once it's registered: /sys/class/thermal/cooling_device[0-*]: @@ -192,6 +208,8 @@ thermal_zone_bind_cooling_device/thermal_zone_unbind_cooling_device. /sys/class/thermal/thermal_zone[0-*]: |---cdev[0-*]: [0-*]th cooling device in current thermal zone |---cdev[0-*]_trip_point: Trip point that cdev[0-*] is associated with + |---cdev[0-*]_weight: Influence of the cooling device in + this thermal zone Besides the thermal zone device sysfs I/F and cooling device sysfs I/F, the generic thermal driver also creates a hwmon sysfs I/F for each _type_ @@ -265,6 +283,14 @@ cdev[0-*]_trip_point point. RO, Optional +cdev[0-*]_weight + The influence of cdev[0-*] in this thermal zone. This value + is relative to the rest of cooling devices in the thermal + zone. For example, if a cooling device has a weight double + than that of other, it's twice as effective in cooling the + thermal zone. + RW, Optional + passive Attribute is only present for zones in which the passive cooling policy is not supported by native thermal driver. Default is zero @@ -289,6 +315,66 @@ emul_temp because userland can easily disable the thermal policy by simply flooding this sysfs node with low temperature values. +sustainable_power + An estimate of the sustained power that can be dissipated by + the thermal zone. Used by the power allocator governor. For + more information see Documentation/thermal/power_allocator.txt + Unit: milliwatts + RW, Optional + +k_po + The proportional term of the power allocator governor's PID + controller during temperature overshoot. Temperature overshoot + is when the current temperature is above the "desired + temperature" trip point. For more information see + Documentation/thermal/power_allocator.txt + RW, Optional + +k_pu + The proportional term of the power allocator governor's PID + controller during temperature undershoot. Temperature undershoot + is when the current temperature is below the "desired + temperature" trip point. For more information see + Documentation/thermal/power_allocator.txt + RW, Optional + +k_i + The integral term of the power allocator governor's PID + controller. This term allows the PID controller to compensate + for long term drift. For more information see + Documentation/thermal/power_allocator.txt + RW, Optional + +k_d + The derivative term of the power allocator governor's PID + controller. For more information see + Documentation/thermal/power_allocator.txt + RW, Optional + +integral_cutoff + Temperature offset from the desired temperature trip point + above which the integral term of the power allocator + governor's PID controller starts accumulating errors. For + example, if integral_cutoff is 0, then the integral term only + accumulates error when temperature is above the desired + temperature trip point. For more information see + Documentation/thermal/power_allocator.txt + RW, Optional + +slope + The slope constant used in a linear extrapolation model + to determine a hotspot temperature based off the sensor's + raw readings. It is up to the device driver to determine + the usage of these values. + RW, Optional + +offset + The offset constant used in a linear extrapolation model + to determine a hotspot temperature based off the sensor's + raw readings. It is up to the device driver to determine + the usage of these values. + RW, Optional + ***************************** * Cooling device attributes * ***************************** @@ -318,7 +404,8 @@ passive, active. If an ACPI thermal zone supports critical, passive, active[0] and active[1] at the same time, it may register itself as a thermal_zone_device (thermal_zone1) with 4 trip points in all. It has one processor and one fan, which are both registered as -thermal_cooling_device. +thermal_cooling_device. Both are considered to have the same +effectiveness in cooling the thermal zone. If the processor is listed in _PSL method, and the fan is listed in _AL0 method, the sys I/F structure will be built like this: @@ -340,8 +427,10 @@ method, the sys I/F structure will be built like this: |---trip_point_3_type: active1 |---cdev0: --->/sys/class/thermal/cooling_device0 |---cdev0_trip_point: 1 /* cdev0 can be used for passive */ + |---cdev0_weight: 1024 |---cdev1: --->/sys/class/thermal/cooling_device3 |---cdev1_trip_point: 2 /* cdev1 can be used for active[0]*/ + |---cdev1_weight: 1024 |cooling_device0: |---type: Processor diff --git a/MAINTAINERS b/MAINTAINERS index d8afd2953678..00411d20525a 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -7594,6 +7594,14 @@ L: linux-pci@vger.kernel.org S: Maintained F: drivers/pci/host/*spear* +PCI MSI DRIVER FOR APPLIEDMICRO XGENE +M: Duc Dang <dhdang@apm.com> +L: linux-pci@vger.kernel.org +L: linux-arm-kernel@lists.infradead.org +S: Maintained +F: Documentation/devicetree/bindings/pci/xgene-pci-msi.txt +F: drivers/pci/host/pci-xgene-msi.c + PCMCIA SUBSYSTEM P: Linux PCMCIA Team L: linux-pcmcia@lists.infradead.org diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig index 19f4cc634b0e..3a9d691afe58 100644 --- a/arch/arm/Kconfig +++ b/arch/arm/Kconfig @@ -1466,6 +1466,7 @@ config HOTPLUG_CPU config ARM_PSCI bool "Support for the ARM Power State Coordination Interface (PSCI)" depends on CPU_V7 + select ARM_PSCI_FW help Say Y here if you want Linux to communicate with system firmware implementing the PSCI specification for CPU-centric power diff --git a/arch/arm/include/asm/psci.h b/arch/arm/include/asm/psci.h index c25ef3ec6d1f..68ee3ce17b82 100644 --- a/arch/arm/include/asm/psci.h +++ b/arch/arm/include/asm/psci.h @@ -14,34 +14,11 @@ #ifndef __ASM_ARM_PSCI_H #define __ASM_ARM_PSCI_H -#define PSCI_POWER_STATE_TYPE_STANDBY 0 -#define PSCI_POWER_STATE_TYPE_POWER_DOWN 1 - -struct psci_power_state { - u16 id; - u8 type; - u8 affinity_level; -}; - -struct psci_operations { - int (*cpu_suspend)(struct psci_power_state state, - unsigned long entry_point); - int (*cpu_off)(struct psci_power_state state); - int (*cpu_on)(unsigned long cpuid, unsigned long entry_point); - int (*migrate)(unsigned long cpuid); - int (*affinity_info)(unsigned long target_affinity, - unsigned long lowest_affinity_level); - int (*migrate_info_type)(void); -}; - -extern struct psci_operations psci_ops; extern struct smp_operations psci_smp_ops; #ifdef CONFIG_ARM_PSCI -int psci_init(void); bool psci_smp_available(void); #else -static inline int psci_init(void) { return 0; } static inline bool psci_smp_available(void) { return false; } #endif diff --git a/arch/arm/kernel/Makefile b/arch/arm/kernel/Makefile index 752725dcbf42..2c06383f4d47 100644 --- a/arch/arm/kernel/Makefile +++ b/arch/arm/kernel/Makefile @@ -86,7 +86,7 @@ obj-$(CONFIG_EARLY_PRINTK) += early_printk.o obj-$(CONFIG_ARM_VIRT_EXT) += hyp-stub.o ifeq ($(CONFIG_ARM_PSCI),y) -obj-y += psci.o psci-call.o +obj-y += psci-call.o obj-$(CONFIG_SMP) += psci_smp.o endif diff --git a/arch/arm/kernel/psci.c b/arch/arm/kernel/psci.c deleted file mode 100644 index f90fdf4ce7c7..000000000000 --- a/arch/arm/kernel/psci.c +++ /dev/null @@ -1,299 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * Copyright (C) 2012 ARM Limited - * - * Author: Will Deacon <will.deacon@arm.com> - */ - -#define pr_fmt(fmt) "psci: " fmt - -#include <linux/init.h> -#include <linux/of.h> -#include <linux/reboot.h> -#include <linux/pm.h> -#include <uapi/linux/psci.h> - -#include <asm/compiler.h> -#include <asm/errno.h> -#include <asm/psci.h> -#include <asm/system_misc.h> - -struct psci_operations psci_ops; - -static int (*invoke_psci_fn)(u32, u32, u32, u32); -typedef int (*psci_initcall_t)(const struct device_node *); - -asmlinkage int __invoke_psci_fn_hvc(u32, u32, u32, u32); -asmlinkage int __invoke_psci_fn_smc(u32, u32, u32, u32); - -enum psci_function { - PSCI_FN_CPU_SUSPEND, - PSCI_FN_CPU_ON, - PSCI_FN_CPU_OFF, - PSCI_FN_MIGRATE, - PSCI_FN_AFFINITY_INFO, - PSCI_FN_MIGRATE_INFO_TYPE, - PSCI_FN_MAX, -}; - -static u32 psci_function_id[PSCI_FN_MAX]; - -static int psci_to_linux_errno(int errno) -{ - switch (errno) { - case PSCI_RET_SUCCESS: - return 0; - case PSCI_RET_NOT_SUPPORTED: - return -EOPNOTSUPP; - case PSCI_RET_INVALID_PARAMS: - return -EINVAL; - case PSCI_RET_DENIED: - return -EPERM; - }; - - return -EINVAL; -} - -static u32 psci_power_state_pack(struct psci_power_state state) -{ - return ((state.id << PSCI_0_2_POWER_STATE_ID_SHIFT) - & PSCI_0_2_POWER_STATE_ID_MASK) | - ((state.type << PSCI_0_2_POWER_STATE_TYPE_SHIFT) - & PSCI_0_2_POWER_STATE_TYPE_MASK) | - ((state.affinity_level << PSCI_0_2_POWER_STATE_AFFL_SHIFT) - & PSCI_0_2_POWER_STATE_AFFL_MASK); -} - -static int psci_get_version(void) -{ - int err; - - err = invoke_psci_fn(PSCI_0_2_FN_PSCI_VERSION, 0, 0, 0); - return err; -} - -static int psci_cpu_suspend(struct psci_power_state state, - unsigned long entry_point) -{ - int err; - u32 fn, power_state; - - fn = psci_function_id[PSCI_FN_CPU_SUSPEND]; - power_state = psci_power_state_pack(state); - err = invoke_psci_fn(fn, power_state, entry_point, 0); - return psci_to_linux_errno(err); -} - -static int psci_cpu_off(struct psci_power_state state) -{ - int err; - u32 fn, power_state; - - fn = psci_function_id[PSCI_FN_CPU_OFF]; - power_state = psci_power_state_pack(state); - err = invoke_psci_fn(fn, power_state, 0, 0); - return psci_to_linux_errno(err); -} - -static int psci_cpu_on(unsigned long cpuid, unsigned long entry_point) -{ - int err; - u32 fn; - - fn = psci_function_id[PSCI_FN_CPU_ON]; - err = invoke_psci_fn(fn, cpuid, entry_point, 0); - return psci_to_linux_errno(err); -} - -static int psci_migrate(unsigned long cpuid) -{ - int err; - u32 fn; - - fn = psci_function_id[PSCI_FN_MIGRATE]; - err = invoke_psci_fn(fn, cpuid, 0, 0); - return psci_to_linux_errno(err); -} - -static int psci_affinity_info(unsigned long target_affinity, - unsigned long lowest_affinity_level) -{ - int err; - u32 fn; - - fn = psci_function_id[PSCI_FN_AFFINITY_INFO]; - err = invoke_psci_fn(fn, target_affinity, lowest_affinity_level, 0); - return err; -} - -static int psci_migrate_info_type(void) -{ - int err; - u32 fn; - - fn = psci_function_id[PSCI_FN_MIGRATE_INFO_TYPE]; - err = invoke_psci_fn(fn, 0, 0, 0); - return err; -} - -static int get_set_conduit_method(struct device_node *np) -{ - const char *method; - - pr_info("probing for conduit method from DT.\n"); - - if (of_property_read_string(np, "method", &method)) { - pr_warn("missing \"method\" property\n"); - return -ENXIO; - } - - if (!strcmp("hvc", method)) { - invoke_psci_fn = __invoke_psci_fn_hvc; - } else if (!strcmp("smc", method)) { - invoke_psci_fn = __invoke_psci_fn_smc; - } else { - pr_warn("invalid \"method\" property: %s\n", method); - return -EINVAL; - } - return 0; -} - -static void psci_sys_reset(enum reboot_mode reboot_mode, const char *cmd) -{ - invoke_psci_fn(PSCI_0_2_FN_SYSTEM_RESET, 0, 0, 0); -} - -static void psci_sys_poweroff(void) -{ - invoke_psci_fn(PSCI_0_2_FN_SYSTEM_OFF, 0, 0, 0); -} - -/* - * PSCI Function IDs for v0.2+ are well defined so use - * standard values. - */ -static int psci_0_2_init(struct device_node *np) -{ - int err, ver; - - err = get_set_conduit_method(np); - - if (err) - goto out_put_node; - - ver = psci_get_version(); - - if (ver == PSCI_RET_NOT_SUPPORTED) { - /* PSCI v0.2 mandates implementation of PSCI_ID_VERSION. */ - pr_err("PSCI firmware does not comply with the v0.2 spec.\n"); - err = -EOPNOTSUPP; - goto out_put_node; - } else { - pr_info("PSCIv%d.%d detected in firmware.\n", - PSCI_VERSION_MAJOR(ver), - PSCI_VERSION_MINOR(ver)); - - if (PSCI_VERSION_MAJOR(ver) == 0 && - PSCI_VERSION_MINOR(ver) < 2) { - err = -EINVAL; - pr_err("Conflicting PSCI version detected.\n"); - goto out_put_node; - } - } - - pr_info("Using standard PSCI v0.2 function IDs\n"); - psci_function_id[PSCI_FN_CPU_SUSPEND] = PSCI_0_2_FN_CPU_SUSPEND; - psci_ops.cpu_suspend = psci_cpu_suspend; - - psci_function_id[PSCI_FN_CPU_OFF] = PSCI_0_2_FN_CPU_OFF; - psci_ops.cpu_off = psci_cpu_off; - - psci_function_id[PSCI_FN_CPU_ON] = PSCI_0_2_FN_CPU_ON; - psci_ops.cpu_on = psci_cpu_on; - - psci_function_id[PSCI_FN_MIGRATE] = PSCI_0_2_FN_MIGRATE; - psci_ops.migrate = psci_migrate; - - psci_function_id[PSCI_FN_AFFINITY_INFO] = PSCI_0_2_FN_AFFINITY_INFO; - psci_ops.affinity_info = psci_affinity_info; - - psci_function_id[PSCI_FN_MIGRATE_INFO_TYPE] = - PSCI_0_2_FN_MIGRATE_INFO_TYPE; - psci_ops.migrate_info_type = psci_migrate_info_type; - - arm_pm_restart = psci_sys_reset; - - pm_power_off = psci_sys_poweroff; - -out_put_node: - of_node_put(np); - return err; -} - -/* - * PSCI < v0.2 get PSCI Function IDs via DT. - */ -static int psci_0_1_init(struct device_node *np) -{ - u32 id; - int err; - - err = get_set_conduit_method(np); - - if (err) - goto out_put_node; - - pr_info("Using PSCI v0.1 Function IDs from DT\n"); - - if (!of_property_read_u32(np, "cpu_suspend", &id)) { - psci_function_id[PSCI_FN_CPU_SUSPEND] = id; - psci_ops.cpu_suspend = psci_cpu_suspend; - } - - if (!of_property_read_u32(np, "cpu_off", &id)) { - psci_function_id[PSCI_FN_CPU_OFF] = id; - psci_ops.cpu_off = psci_cpu_off; - } - - if (!of_property_read_u32(np, "cpu_on", &id)) { - psci_function_id[PSCI_FN_CPU_ON] = id; - psci_ops.cpu_on = psci_cpu_on; - } - - if (!of_property_read_u32(np, "migrate", &id)) { - psci_function_id[PSCI_FN_MIGRATE] = id; - psci_ops.migrate = psci_migrate; - } - -out_put_node: - of_node_put(np); - return err; -} - -static const struct of_device_id psci_of_match[] __initconst = { - { .compatible = "arm,psci", .data = psci_0_1_init}, - { .compatible = "arm,psci-0.2", .data = psci_0_2_init}, - {}, -}; - -int __init psci_init(void) -{ - struct device_node *np; - const struct of_device_id *matched_np; - psci_initcall_t init_fn; - - np = of_find_matching_node_and_match(NULL, psci_of_match, &matched_np); - if (!np) - return -ENODEV; - - init_fn = (psci_initcall_t)matched_np->data; - return init_fn(np); -} diff --git a/arch/arm/kernel/psci_smp.c b/arch/arm/kernel/psci_smp.c index 28a1db4da704..9d479b2ea40d 100644 --- a/arch/arm/kernel/psci_smp.c +++ b/arch/arm/kernel/psci_smp.c @@ -17,6 +17,8 @@ #include <linux/smp.h> #include <linux/of.h> #include <linux/delay.h> +#include <linux/psci.h> + #include <uapi/linux/psci.h> #include <asm/psci.h> @@ -51,25 +53,37 @@ static int psci_boot_secondary(unsigned int cpu, struct task_struct *idle) { if (psci_ops.cpu_on) return psci_ops.cpu_on(cpu_logical_map(cpu), - __pa(secondary_startup)); + virt_to_idmap(&secondary_startup)); return -ENODEV; } #ifdef CONFIG_HOTPLUG_CPU -void __ref psci_cpu_die(unsigned int cpu) +int psci_cpu_disable(unsigned int cpu) +{ + /* Fail early if we don't have CPU_OFF support */ + if (!psci_ops.cpu_off) + return -EOPNOTSUPP; + + /* Trusted OS will deny CPU_OFF */ + if (psci_tos_resident_on(cpu)) + return -EPERM; + + return 0; +} + +void psci_cpu_die(unsigned int cpu) { - const struct psci_power_state ps = { - .type = PSCI_POWER_STATE_TYPE_POWER_DOWN, - }; + u32 state = PSCI_POWER_STATE_TYPE_POWER_DOWN << + PSCI_0_2_POWER_STATE_TYPE_SHIFT; - if (psci_ops.cpu_off) - psci_ops.cpu_off(ps); + if (psci_ops.cpu_off) + psci_ops.cpu_off(state); - /* We should never return */ - panic("psci: cpu %d failed to shutdown\n", cpu); + /* We should never return */ + panic("psci: cpu %d failed to shutdown\n", cpu); } -int __ref psci_cpu_kill(unsigned int cpu) +int psci_cpu_kill(unsigned int cpu) { int err, i; @@ -109,6 +123,7 @@ bool __init psci_smp_available(void) struct smp_operations __initdata psci_smp_ops = { .smp_boot_secondary = psci_boot_secondary, #ifdef CONFIG_HOTPLUG_CPU + .cpu_disable = psci_cpu_disable, .cpu_die = psci_cpu_die, .cpu_kill = psci_cpu_kill, #endif diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c index 6c777e908a24..3224680e44f4 100644 --- a/arch/arm/kernel/setup.c +++ b/arch/arm/kernel/setup.c @@ -31,6 +31,7 @@ #include <linux/bug.h> #include <linux/compiler.h> #include <linux/sort.h> +#include <linux/psci.h> #include <asm/unified.h> #include <asm/cp15.h> @@ -950,7 +951,7 @@ void __init setup_arch(char **cmdline_p) unflatten_device_tree(); arm_dt_init_cpu_maps(); - psci_init(); + psci_dt_init(); #ifdef CONFIG_SMP if (is_smp()) { if (!mdesc->smp_init || !mdesc->smp_init()) { diff --git a/arch/arm/kvm/psci.c b/arch/arm/kvm/psci.c index 531e922486b2..ad6f6424f1d1 100644 --- a/arch/arm/kvm/psci.c +++ b/arch/arm/kvm/psci.c @@ -24,6 +24,8 @@ #include <asm/kvm_psci.h> #include <asm/kvm_host.h> +#include <uapi/linux/psci.h> + /* * This is an implementation of the Power State Coordination Interface * as described in ARM document number ARM DEN 0022A. @@ -124,7 +126,7 @@ static unsigned long kvm_psci_vcpu_on(struct kvm_vcpu *source_vcpu) static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu) { - int i; + int i, matching_cpus = 0; unsigned long mpidr; unsigned long target_affinity; unsigned long target_affinity_mask; @@ -149,12 +151,16 @@ static unsigned long kvm_psci_vcpu_affinity_info(struct kvm_vcpu *vcpu) */ kvm_for_each_vcpu(i, tmp, kvm) { mpidr = kvm_vcpu_get_mpidr_aff(tmp); - if (((mpidr & target_affinity_mask) == target_affinity) && - !tmp->arch.pause) { - return PSCI_0_2_AFFINITY_LEVEL_ON; + if ((mpidr & target_affinity_mask) == target_affinity) { + matching_cpus++; + if (!tmp->arch.pause) + return PSCI_0_2_AFFINITY_LEVEL_ON; } } + if (!matching_cpus) + return PSCI_RET_INVALID_PARAMS; + return PSCI_0_2_AFFINITY_LEVEL_OFF; } diff --git a/arch/arm/mach-highbank/highbank.c b/arch/arm/mach-highbank/highbank.c index 231fba0d03e5..6050a14faee6 100644 --- a/arch/arm/mach-highbank/highbank.c +++ b/arch/arm/mach-highbank/highbank.c @@ -28,8 +28,8 @@ #include <linux/reboot.h> #include <linux/amba/bus.h> #include <linux/platform_device.h> +#include <linux/psci.h> -#include <asm/psci.h> #include <asm/hardware/cache-l2x0.h> #include <asm/mach/arch.h> #include <asm/mach/map.h> diff --git a/arch/arm/mach-highbank/pm.c b/arch/arm/mach-highbank/pm.c index 7f2bd85eb935..400311695548 100644 --- a/arch/arm/mach-highbank/pm.c +++ b/arch/arm/mach-highbank/pm.c @@ -16,19 +16,21 @@ #include <linux/cpu_pm.h> #include <linux/init.h> +#include <linux/psci.h> #include <linux/suspend.h> #include <asm/suspend.h> -#include <asm/psci.h> + +#include <uapi/linux/psci.h> + +#define HIGHBANK_SUSPEND_PARAM \ + ((0 << PSCI_0_2_POWER_STATE_ID_SHIFT) | \ + (1 << PSCI_0_2_POWER_STATE_AFFL_SHIFT) | \ + (PSCI_POWER_STATE_TYPE_POWER_DOWN << PSCI_0_2_POWER_STATE_TYPE_SHIFT)) static int highbank_suspend_finish(unsigned long val) { - const struct psci_power_state ps = { - .type = PSCI_POWER_STATE_TYPE_POWER_DOWN, - .affinity_level = 1, - }; - - return psci_ops.cpu_suspend(ps, __pa(cpu_resume)); + return psci_ops.cpu_suspend(HIGHBANK_SUSPEND_PARAM, __pa(cpu_resume)); } static int highbank_pm_enter(suspend_state_t state) diff --git a/arch/arm/mach-omap2/omap-hotplug.c b/arch/arm/mach-omap2/omap-hotplug.c index 971791fe9a3f..593fec753b28 100644 --- a/arch/arm/mach-omap2/omap-hotplug.c +++ b/arch/arm/mach-omap2/omap-hotplug.c @@ -27,7 +27,7 @@ * platform-specific code to shutdown a CPU * Called with IRQs disabled */ -void __ref omap4_cpu_die(unsigned int cpu) +void omap4_cpu_die(unsigned int cpu) { unsigned int boot_cpu = 0; void __iomem *base = omap_get_wakeupgen_base(); diff --git a/arch/arm/mach-omap2/omap-wakeupgen.c b/arch/arm/mach-omap2/omap-wakeupgen.c index 6833df45d7b1..3e059b215a1d 100644 --- a/arch/arm/mach-omap2/omap-wakeupgen.c +++ b/arch/arm/mach-omap2/omap-wakeupgen.c @@ -330,7 +330,7 @@ static int irq_cpu_hotplug_notify(struct notifier_block *self, return NOTIFY_OK; } -static struct notifier_block __refdata irq_hotplug_notifier = { +static struct notifier_block irq_hotplug_notifier = { .notifier_call = irq_cpu_hotplug_notify, }; diff --git a/arch/arm/mach-prima2/hotplug.c b/arch/arm/mach-prima2/hotplug.c index 0ab2f8bae28e..a728c78b996f 100644 --- a/arch/arm/mach-prima2/hotplug.c +++ b/arch/arm/mach-prima2/hotplug.c @@ -32,7 +32,7 @@ static inline void platform_do_lowpower(unsigned int cpu) * * Called with IRQs disabled */ -void __ref sirfsoc_cpu_die(unsigned int cpu) +void sirfsoc_cpu_die(unsigned int cpu) { platform_do_lowpower(cpu); } diff --git a/arch/arm/mach-qcom/platsmp.c b/arch/arm/mach-qcom/platsmp.c index 5cde63a64b34..9b00123a315d 100644 --- a/arch/arm/mach-qcom/platsmp.c +++ b/arch/arm/mach-qcom/platsmp.c @@ -49,7 +49,7 @@ extern void secondary_startup_arm(void); static DEFINE_SPINLOCK(boot_lock); #ifdef CONFIG_HOTPLUG_CPU -static void __ref qcom_cpu_die(unsigned int cpu) +static void qcom_cpu_die(unsigned int cpu) { wfi(); } diff --git a/arch/arm/mach-realview/hotplug.c b/arch/arm/mach-realview/hotplug.c index ac22dd41b135..968e2d1964f6 100644 --- a/arch/arm/mach-realview/hotplug.c +++ b/arch/arm/mach-realview/hotplug.c @@ -90,7 +90,7 @@ static inline void platform_do_lowpower(unsigned int cpu, int *spurious) * * Called with IRQs disabled */ -void __ref realview_cpu_die(unsigned int cpu) +void realview_cpu_die(unsigned int cpu) { int spurious = 0; diff --git a/arch/arm/mach-spear/hotplug.c b/arch/arm/mach-spear/hotplug.c index d97749c642ce..12edd1cf8a12 100644 --- a/arch/arm/mach-spear/hotplug.c +++ b/arch/arm/mach-spear/hotplug.c @@ -80,7 +80,7 @@ static inline void spear13xx_do_lowpower(unsigned int cpu, int *spurious) * * Called with IRQs disabled */ -void __ref spear13xx_cpu_die(unsigned int cpu) +void spear13xx_cpu_die(unsigned int cpu) { int spurious = 0; diff --git a/arch/arm/mach-tegra/hotplug.c b/arch/arm/mach-tegra/hotplug.c index 6fc71f1534b0..1b129899a277 100644 --- a/arch/arm/mach-tegra/hotplug.c +++ b/arch/arm/mach-tegra/hotplug.c @@ -37,7 +37,7 @@ int tegra_cpu_kill(unsigned cpu) * * Called with IRQs disabled */ -void __ref tegra_cpu_die(unsigned int cpu) +void tegra_cpu_die(unsigned int cpu) { if (!tegra_hotplug_shutdown) { WARN(1, "hotplug is not yet initialized\n"); diff --git a/arch/arm/mach-ux500/hotplug.c b/arch/arm/mach-ux500/hotplug.c index 2bc00b085e38..1cbed0331fd3 100644 --- a/arch/arm/mach-ux500/hotplug.c +++ b/arch/arm/mach-ux500/hotplug.c @@ -21,7 +21,7 @@ * * Called with IRQs disabled */ -void __ref ux500_cpu_die(unsigned int cpu) +void ux500_cpu_die(unsigned int cpu) { /* directly enter low power state, skipping secure registers */ for (;;) { diff --git a/arch/arm/mach-vexpress/hotplug.c b/arch/arm/mach-vexpress/hotplug.c index f0ce6b8f5e71..f2fafc10a91d 100644 --- a/arch/arm/mach-vexpress/hotplug.c +++ b/arch/arm/mach-vexpress/hotplug.c @@ -85,7 +85,7 @@ static inline void platform_do_lowpower(unsigned int cpu, int *spurious) * * Called with IRQs disabled */ -void __ref vexpress_cpu_die(unsigned int cpu) +void vexpress_cpu_die(unsigned int cpu) { int spurious = 0; diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index be92fa0f2f35..8d32f7e9695f 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -22,6 +22,7 @@ #include <linux/memblock.h> #include <linux/dma-contiguous.h> #include <linux/sizes.h> +#include <linux/stop_machine.h> #include <asm/cp15.h> #include <asm/mach-types.h> @@ -626,12 +627,10 @@ static struct section_perm ro_perms[] = { * safe to be called with preemption disabled, as under stop_machine(). */ static inline void section_update(unsigned long addr, pmdval_t mask, - pmdval_t prot) + pmdval_t prot, struct mm_struct *mm) { - struct mm_struct *mm; pmd_t *pmd; - mm = current->active_mm; pmd = pmd_offset(pud_offset(pgd_offset(mm, addr), addr), addr); #ifdef CONFIG_ARM_LPAE @@ -655,49 +654,82 @@ static inline bool arch_has_strict_perms(void) return !!(get_cr() & CR_XP); } -#define set_section_perms(perms, field) { \ - size_t i; \ - unsigned long addr; \ - \ - if (!arch_has_strict_perms()) \ - return; \ - \ - for (i = 0; i < ARRAY_SIZE(perms); i++) { \ - if (!IS_ALIGNED(perms[i].start, SECTION_SIZE) || \ - !IS_ALIGNED(perms[i].end, SECTION_SIZE)) { \ - pr_err("BUG: section %lx-%lx not aligned to %lx\n", \ - perms[i].start, perms[i].end, \ - SECTION_SIZE); \ - continue; \ - } \ - \ - for (addr = perms[i].start; \ - addr < perms[i].end; \ - addr += SECTION_SIZE) \ - section_update(addr, perms[i].mask, \ - perms[i].field); \ - } \ +void set_section_perms(struct section_perm *perms, int n, bool set, + struct mm_struct *mm) +{ + size_t i; + unsigned long addr; + + if (!arch_has_strict_perms()) + return; + + for (i = 0; i < n; i++) { + if (!IS_ALIGNED(perms[i].start, SECTION_SIZE) || + !IS_ALIGNED(perms[i].end, SECTION_SIZE)) { + pr_err("BUG: section %lx-%lx not aligned to %lx\n", + perms[i].start, perms[i].end, + SECTION_SIZE); + continue; + } + + for (addr = perms[i].start; + addr < perms[i].end; + addr += SECTION_SIZE) + section_update(addr, perms[i].mask, + set ? perms[i].prot : perms[i].clear, mm); + } + } -static inline void fix_kernmem_perms(void) +static void update_sections_early(struct section_perm perms[], int n) { - set_section_perms(nx_perms, prot); + struct task_struct *t, *s; + + read_lock(&tasklist_lock); + for_each_process(t) { + if (t->flags & PF_KTHREAD) + continue; + for_each_thread(t, s) + set_section_perms(perms, n, true, s->mm); + } + read_unlock(&tasklist_lock); + set_section_perms(perms, n, true, current->active_mm); + set_section_perms(perms, n, true, &init_mm); +} + +int __fix_kernmem_perms(void *unused) +{ + update_sections_early(nx_perms, ARRAY_SIZE(nx_perms)); + return 0; +} + +void fix_kernmem_perms(void) +{ + stop_machine(__fix_kernmem_perms, NULL, NULL); } #ifdef CONFIG_DEBUG_RODATA +int __mark_rodata_ro(void *unused) +{ + update_sections_early(ro_perms, ARRAY_SIZE(ro_perms)); + return 0; +} + void mark_rodata_ro(void) { - set_section_perms(ro_perms, prot); + stop_machine(__mark_rodata_ro, NULL, NULL); } void set_kernel_text_rw(void) { - set_section_perms(ro_perms, clear); + set_section_perms(ro_perms, ARRAY_SIZE(ro_perms), false, + current->active_mm); } void set_kernel_text_ro(void) { - set_section_perms(ro_perms, prot); + set_section_perms(ro_perms, ARRAY_SIZE(ro_perms), true, + current->active_mm); } #endif /* CONFIG_DEBUG_RODATA */ diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 6f0a3b41b009..8944f2ab8fd7 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -19,6 +19,7 @@ config ARM64 select ARM_GIC_V2M if PCI_MSI select ARM_GIC_V3 select ARM_GIC_V3_ITS if PCI_MSI + select ARM_PSCI_FW select BUILDTIME_EXTABLE_SORT select CLONE_BACKWARDS select COMMON_CLK @@ -26,7 +27,7 @@ config ARM64 select DCACHE_WORD_ACCESS select GENERIC_ALLOCATOR select GENERIC_CLOCKEVENTS - select GENERIC_CLOCKEVENTS_BROADCAST if SMP + select GENERIC_CLOCKEVENTS_BROADCAST select GENERIC_CPU_AUTOPROBE select GENERIC_EARLY_IOREMAP select GENERIC_IRQ_PROBE @@ -138,6 +139,9 @@ config NEED_DMA_MAP_STATE config NEED_SG_DMA_LENGTH def_bool y +config SMP + def_bool y + config SWIOTLB def_bool y @@ -486,22 +490,8 @@ config CPU_BIG_ENDIAN help Say Y if you plan on running a kernel in big-endian mode. -config SMP - bool "Symmetric Multi-Processing" - help - This enables support for systems with more than one CPU. If - you say N here, the kernel will run on single and - multiprocessor machines, but will use only one CPU of a - multiprocessor machine. If you say Y here, the kernel will run - on many, but not all, single processor machines. On a single - processor machine, the kernel will run faster if you say N - here. - - If you don't know what to do here, say N. - config SCHED_MC bool "Multi-core scheduler support" - depends on SMP help Multi-core scheduler support improves the CPU scheduler's decision making when dealing with multi-core CPU chips at a cost of slightly @@ -509,7 +499,6 @@ config SCHED_MC config SCHED_SMT bool "SMT scheduler support" - depends on SMP help Improves the CPU scheduler's decision making when dealing with MultiThreading at a cost of slightly increased overhead in some @@ -518,23 +507,17 @@ config SCHED_SMT config NR_CPUS int "Maximum number of CPUs (2-4096)" range 2 4096 - depends on SMP # These have to remain sorted largest to smallest default "64" config HOTPLUG_CPU bool "Support for hot-pluggable CPUs" - depends on SMP help Say Y here to experiment with turning CPUs off and on. CPUs can be controlled through /sys/devices/system/cpu. source kernel/Kconfig.preempt -config UP_LATE_INIT - def_bool y - depends on !SMP - config HZ int default 100 @@ -609,6 +592,20 @@ config FORCE_MAX_ZONEORDER default "14" if (ARM64_64K_PAGES && TRANSPARENT_HUGEPAGE) default "11" +config ARM64_PAN + bool "Enable support for Privileged Access Never (PAN)" + default y + help + Privileged Access Never (PAN; part of the ARMv8.1 Extensions) + prevents the kernel or hypervisor from accessing user-space (EL0) + memory directly. + + Choosing this option will cause any unprotected (not using + copy_to_user et al) memory access to fail with a permission fault. + + The feature is detected at runtime, and will remain as a 'nop' + instruction if the cpu does not implement the feature. + menuconfig ARMV8_DEPRECATED bool "Emulate deprecated/obsolete ARMv8 instructions" depends on COMPAT diff --git a/arch/arm64/boot/dts/apm/apm-storm.dtsi b/arch/arm64/boot/dts/apm/apm-storm.dtsi index c8d3e0e86678..d8f3a1c65ecd 100644 --- a/arch/arm64/boot/dts/apm/apm-storm.dtsi +++ b/arch/arm64/boot/dts/apm/apm-storm.dtsi @@ -374,6 +374,28 @@ }; }; + msi: msi@79000000 { + compatible = "apm,xgene1-msi"; + msi-controller; + reg = <0x00 0x79000000 0x0 0x900000>; + interrupts = < 0x0 0x10 0x4 + 0x0 0x11 0x4 + 0x0 0x12 0x4 + 0x0 0x13 0x4 + 0x0 0x14 0x4 + 0x0 0x15 0x4 + 0x0 0x16 0x4 + 0x0 0x17 0x4 + 0x0 0x18 0x4 + 0x0 0x19 0x4 + 0x0 0x1a 0x4 + 0x0 0x1b 0x4 + 0x0 0x1c 0x4 + 0x0 0x1d 0x4 + 0x0 0x1e 0x4 + 0x0 0x1f 0x4>; + }; + pcie0: pcie@1f2b0000 { status = "disabled"; device_type = "pci"; @@ -395,6 +417,7 @@ 0x0 0x0 0x0 0x4 &gic 0x0 0xc5 0x1>; dma-coherent; clocks = <&pcie0clk 0>; + msi-parent = <&msi>; }; pcie1: pcie@1f2c0000 { @@ -418,6 +441,7 @@ 0x0 0x0 0x0 0x4 &gic 0x0 0xcb 0x1>; dma-coherent; clocks = <&pcie1clk 0>; + msi-parent = <&msi>; }; pcie2: pcie@1f2d0000 { @@ -441,6 +465,7 @@ 0x0 0x0 0x0 0x4 &gic 0x0 0xd1 0x1>; dma-coherent; clocks = <&pcie2clk 0>; + msi-parent = <&msi>; }; pcie3: pcie@1f500000 { @@ -464,6 +489,7 @@ 0x0 0x0 0x0 0x4 &gic 0x0 0xd7 0x1>; dma-coherent; clocks = <&pcie3clk 0>; + msi-parent = <&msi>; }; pcie4: pcie@1f510000 { @@ -487,6 +513,7 @@ 0x0 0x0 0x0 0x4 &gic 0x0 0xdd 0x1>; dma-coherent; clocks = <&pcie4clk 0>; + msi-parent = <&msi>; }; serial0: serial@1c020000 { diff --git a/arch/arm64/boot/dts/arm/Makefile b/arch/arm64/boot/dts/arm/Makefile index 301a0dada1fe..c5c98b91514e 100644 --- a/arch/arm64/boot/dts/arm/Makefile +++ b/arch/arm64/boot/dts/arm/Makefile @@ -1,5 +1,5 @@ dtb-$(CONFIG_ARCH_VEXPRESS) += foundation-v8.dtb -dtb-$(CONFIG_ARCH_VEXPRESS) += juno.dtb +dtb-$(CONFIG_ARCH_VEXPRESS) += juno.dtb juno-r1.dtb dtb-$(CONFIG_ARCH_VEXPRESS) += rtsm_ve-aemv8a.dtb always := $(dtb-y) diff --git a/arch/arm64/boot/dts/arm/juno-base.dtsi b/arch/arm64/boot/dts/arm/juno-base.dtsi new file mode 100644 index 000000000000..e3ee96036eca --- /dev/null +++ b/arch/arm64/boot/dts/arm/juno-base.dtsi @@ -0,0 +1,154 @@ + /* + * Devices shared by all Juno boards + */ + + memtimer: timer@2a810000 { + compatible = "arm,armv7-timer-mem"; + reg = <0x0 0x2a810000 0x0 0x10000>; + clock-frequency = <50000000>; + #address-cells = <2>; + #size-cells = <2>; + ranges; + status = "disabled"; + frame@2a830000 { + frame-number = <1>; + interrupts = <0 60 4>; + reg = <0x0 0x2a830000 0x0 0x10000>; + }; + }; + + gic: interrupt-controller@2c010000 { + compatible = "arm,gic-400", "arm,cortex-a15-gic"; + reg = <0x0 0x2c010000 0 0x1000>, + <0x0 0x2c02f000 0 0x2000>, + <0x0 0x2c04f000 0 0x2000>, + <0x0 0x2c06f000 0 0x2000>; + #address-cells = <2>; + #interrupt-cells = <3>; + #size-cells = <2>; + interrupt-controller; + interrupts = <GIC_PPI 9 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_HIGH)>; + ranges = <0 0 0 0x2c1c0000 0 0x40000>; + v2m_0: v2m@0 { + compatible = "arm,gic-v2m-frame"; + msi-controller; + reg = <0 0 0 0x1000>; + }; + }; + + timer { + compatible = "arm,armv8-timer"; + interrupts = <GIC_PPI 13 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_LOW)>, + <GIC_PPI 14 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_LOW)>, + <GIC_PPI 11 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_LOW)>, + <GIC_PPI 10 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_LOW)>; + }; + + /include/ "juno-clocks.dtsi" + + dma@7ff00000 { + compatible = "arm,pl330", "arm,primecell"; + reg = <0x0 0x7ff00000 0 0x1000>; + #dma-cells = <1>; + #dma-channels = <8>; + #dma-requests = <32>; + interrupts = <GIC_SPI 88 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 89 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 90 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 91 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 108 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 109 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 110 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 111 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&soc_faxiclk>; + clock-names = "apb_pclk"; + }; + + soc_uart0: uart@7ff80000 { + compatible = "arm,pl011", "arm,primecell"; + reg = <0x0 0x7ff80000 0x0 0x1000>; + interrupts = <GIC_SPI 83 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&soc_uartclk>, <&soc_refclk100mhz>; + clock-names = "uartclk", "apb_pclk"; + }; + + i2c@7ffa0000 { + compatible = "snps,designware-i2c"; + reg = <0x0 0x7ffa0000 0x0 0x1000>; + #address-cells = <1>; + #size-cells = <0>; + interrupts = <GIC_SPI 104 IRQ_TYPE_LEVEL_HIGH>; + clock-frequency = <400000>; + i2c-sda-hold-time-ns = <500>; + clocks = <&soc_smc50mhz>; + + dvi0: dvi-transmitter@70 { + compatible = "nxp,tda998x"; + reg = <0x70>; + }; + + dvi1: dvi-transmitter@71 { + compatible = "nxp,tda998x"; + reg = <0x71>; + }; + }; + + ohci@7ffb0000 { + compatible = "generic-ohci"; + reg = <0x0 0x7ffb0000 0x0 0x10000>; + interrupts = <GIC_SPI 116 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&soc_usb48mhz>; + }; + + ehci@7ffc0000 { + compatible = "generic-ehci"; + reg = <0x0 0x7ffc0000 0x0 0x10000>; + interrupts = <GIC_SPI 117 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&soc_usb48mhz>; + }; + + memory-controller@7ffd0000 { + compatible = "arm,pl354", "arm,primecell"; + reg = <0 0x7ffd0000 0 0x1000>; + interrupts = <GIC_SPI 86 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 87 IRQ_TYPE_LEVEL_HIGH>; + clocks = <&soc_smc50mhz>; + clock-names = "apb_pclk"; + }; + + memory@80000000 { + device_type = "memory"; + /* last 16MB of the first memory area is reserved for secure world use by firmware */ + reg = <0x00000000 0x80000000 0x0 0x7f000000>, + <0x00000008 0x80000000 0x1 0x80000000>; + }; + + smb { + compatible = "simple-bus"; + #address-cells = <2>; + #size-cells = <1>; + ranges = <0 0 0 0x08000000 0x04000000>, + <1 0 0 0x14000000 0x04000000>, + <2 0 0 0x18000000 0x04000000>, + <3 0 0 0x1c000000 0x04000000>, + <4 0 0 0x0c000000 0x04000000>, + <5 0 0 0x10000000 0x04000000>; + + #interrupt-cells = <1>; + interrupt-map-mask = <0 0 15>; + interrupt-map = <0 0 0 &gic 0 0 0 68 IRQ_TYPE_LEVEL_HIGH>, + <0 0 1 &gic 0 0 0 69 IRQ_TYPE_LEVEL_HIGH>, + <0 0 2 &gic 0 0 0 70 IRQ_TYPE_LEVEL_HIGH>, + <0 0 3 &gic 0 0 0 160 IRQ_TYPE_LEVEL_HIGH>, + <0 0 4 &gic 0 0 0 161 IRQ_TYPE_LEVEL_HIGH>, + <0 0 5 &gic 0 0 0 162 IRQ_TYPE_LEVEL_HIGH>, + <0 0 6 &gic 0 0 0 163 IRQ_TYPE_LEVEL_HIGH>, + <0 0 7 &gic 0 0 0 164 IRQ_TYPE_LEVEL_HIGH>, + <0 0 8 &gic 0 0 0 165 IRQ_TYPE_LEVEL_HIGH>, + <0 0 9 &gic 0 0 0 166 IRQ_TYPE_LEVEL_HIGH>, + <0 0 10 &gic 0 0 0 167 IRQ_TYPE_LEVEL_HIGH>, + <0 0 11 &gic 0 0 0 168 IRQ_TYPE_LEVEL_HIGH>, + <0 0 12 &gic 0 0 0 169 IRQ_TYPE_LEVEL_HIGH>; + + /include/ "juno-motherboard.dtsi" + }; diff --git a/arch/arm64/boot/dts/arm/juno-clocks.dtsi b/arch/arm64/boot/dts/arm/juno-clocks.dtsi index c9b89efe0f56..25352ed943e6 100644 --- a/arch/arm64/boot/dts/arm/juno-clocks.dtsi +++ b/arch/arm64/boot/dts/arm/juno-clocks.dtsi @@ -36,9 +36,9 @@ clock-output-names = "apb_pclk"; }; - soc_faxiclk: refclk533mhz { + soc_faxiclk: refclk400mhz { compatible = "fixed-clock"; #clock-cells = <0>; - clock-frequency = <533000000>; + clock-frequency = <400000000>; clock-output-names = "faxi_clk"; }; diff --git a/arch/arm64/boot/dts/arm/juno-r1.dts b/arch/arm64/boot/dts/arm/juno-r1.dts new file mode 100644 index 000000000000..a25964d26bda --- /dev/null +++ b/arch/arm64/boot/dts/arm/juno-r1.dts @@ -0,0 +1,136 @@ +/* + * ARM Ltd. Juno Platform + * + * Copyright (c) 2015 ARM Ltd. + * + * This file is licensed under a dual GPLv2 or BSD license. + */ + +/dts-v1/; + +#include <dt-bindings/interrupt-controller/arm-gic.h> + +/ { + model = "ARM Juno development board (r1)"; + compatible = "arm,juno-r1", "arm,juno", "arm,vexpress"; + interrupt-parent = <&gic>; + #address-cells = <2>; + #size-cells = <2>; + + aliases { + serial0 = &soc_uart0; + }; + + chosen { + stdout-path = "serial0:115200n8"; + }; + + psci { + compatible = "arm,psci-0.2"; + method = "smc"; + }; + + cpus { + #address-cells = <2>; + #size-cells = <0>; + + A57_0: cpu@0 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x0>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + }; + + A57_1: cpu@1 { + compatible = "arm,cortex-a57","arm,armv8"; + reg = <0x0 0x1>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A57_L2>; + }; + + A53_0: cpu@100 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x100>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + }; + + A53_1: cpu@101 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x101>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + }; + + A53_2: cpu@102 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x102>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + }; + + A53_3: cpu@103 { + compatible = "arm,cortex-a53","arm,armv8"; + reg = <0x0 0x103>; + device_type = "cpu"; + enable-method = "psci"; + next-level-cache = <&A53_L2>; + }; + + A57_L2: l2-cache0 { + compatible = "cache"; + }; + + A53_L2: l2-cache1 { + compatible = "cache"; + }; + }; + + pmu { + compatible = "arm,armv8-pmuv3"; + interrupts = <GIC_SPI 02 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 06 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 18 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 22 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 26 IRQ_TYPE_LEVEL_HIGH>, + <GIC_SPI 30 IRQ_TYPE_LEVEL_HIGH>; + interrupt-affinity = <&A57_0>, + <&A57_1>, + <&A53_0>, + <&A53_1>, + <&A53_2>, + <&A53_3>; + }; + + #include "juno-base.dtsi" + + pcie-controller@40000000 { + compatible = "arm,juno-r1-pcie", "plda,xpressrich3-axi", "pci-host-ecam-generic"; + device_type = "pci"; + reg = <0 0x40000000 0 0x10000000>; /* ECAM config space */ + bus-range = <0 255>; + linux,pci-domain = <0>; + #address-cells = <3>; + #size-cells = <2>; + dma-coherent; + ranges = <0x01000000 0x00 0x5f800000 0x00 0x5f800000 0x0 0x00800000>, + <0x02000000 0x00 0x50000000 0x00 0x50000000 0x0 0x08000000>, + <0x42000000 0x40 0x00000000 0x40 0x00000000 0x1 0x00000000>; + #interrupt-cells = <1>; + interrupt-map-mask = <0 0 0 7>; + interrupt-map = <0 0 0 1 &gic 0 0 0 136 4>, + <0 0 0 2 &gic 0 0 0 137 4>, + <0 0 0 3 &gic 0 0 0 138 4>, + <0 0 0 4 &gic 0 0 0 139 4>; + msi-parent = <&v2m_0>; + }; +}; + +&memtimer { + status = "okay"; +}; diff --git a/arch/arm64/boot/dts/arm/juno.dts b/arch/arm64/boot/dts/arm/juno.dts index 5e9110a3353d..d7cbdd482a61 100644 --- a/arch/arm64/boot/dts/arm/juno.dts +++ b/arch/arm64/boot/dts/arm/juno.dts @@ -91,33 +91,6 @@ }; }; - memory@80000000 { - device_type = "memory"; - /* last 16MB of the first memory area is reserved for secure world use by firmware */ - reg = <0x00000000 0x80000000 0x0 0x7f000000>, - <0x00000008 0x80000000 0x1 0x80000000>; - }; - - gic: interrupt-controller@2c001000 { - compatible = "arm,gic-400", "arm,cortex-a15-gic"; - reg = <0x0 0x2c010000 0 0x1000>, - <0x0 0x2c02f000 0 0x2000>, - <0x0 0x2c04f000 0 0x2000>, - <0x0 0x2c06f000 0 0x2000>; - #address-cells = <0>; - #interrupt-cells = <3>; - interrupt-controller; - interrupts = <GIC_PPI 9 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_HIGH)>; - }; - - timer { - compatible = "arm,armv8-timer"; - interrupts = <GIC_PPI 13 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_LOW)>, - <GIC_PPI 14 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_LOW)>, - <GIC_PPI 11 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_LOW)>, - <GIC_PPI 10 (GIC_CPU_MASK_SIMPLE(6) | IRQ_TYPE_LEVEL_LOW)>; - }; - pmu { compatible = "arm,armv8-pmuv3"; interrupts = <GIC_SPI 02 IRQ_TYPE_LEVEL_HIGH>, @@ -134,105 +107,5 @@ <&A53_3>; }; - /include/ "juno-clocks.dtsi" - - dma@7ff00000 { - compatible = "arm,pl330", "arm,primecell"; - reg = <0x0 0x7ff00000 0 0x1000>; - #dma-cells = <1>; - #dma-channels = <8>; - #dma-requests = <32>; - interrupts = <GIC_SPI 88 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 89 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 90 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 91 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 108 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 109 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 110 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 111 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&soc_faxiclk>; - clock-names = "apb_pclk"; - }; - - soc_uart0: uart@7ff80000 { - compatible = "arm,pl011", "arm,primecell"; - reg = <0x0 0x7ff80000 0x0 0x1000>; - interrupts = <GIC_SPI 83 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&soc_uartclk>, <&soc_refclk100mhz>; - clock-names = "uartclk", "apb_pclk"; - }; - - i2c@7ffa0000 { - compatible = "snps,designware-i2c"; - reg = <0x0 0x7ffa0000 0x0 0x1000>; - #address-cells = <1>; - #size-cells = <0>; - interrupts = <GIC_SPI 104 IRQ_TYPE_LEVEL_HIGH>; - clock-frequency = <400000>; - i2c-sda-hold-time-ns = <500>; - clocks = <&soc_smc50mhz>; - - dvi0: dvi-transmitter@70 { - compatible = "nxp,tda998x"; - reg = <0x70>; - }; - - dvi1: dvi-transmitter@71 { - compatible = "nxp,tda998x"; - reg = <0x71>; - }; - }; - - ohci@7ffb0000 { - compatible = "generic-ohci"; - reg = <0x0 0x7ffb0000 0x0 0x10000>; - interrupts = <GIC_SPI 116 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&soc_usb48mhz>; - }; - - ehci@7ffc0000 { - compatible = "generic-ehci"; - reg = <0x0 0x7ffc0000 0x0 0x10000>; - interrupts = <GIC_SPI 117 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&soc_usb48mhz>; - }; - - memory-controller@7ffd0000 { - compatible = "arm,pl354", "arm,primecell"; - reg = <0 0x7ffd0000 0 0x1000>; - interrupts = <GIC_SPI 86 IRQ_TYPE_LEVEL_HIGH>, - <GIC_SPI 87 IRQ_TYPE_LEVEL_HIGH>; - clocks = <&soc_smc50mhz>; - clock-names = "apb_pclk"; - }; - - smb { - compatible = "simple-bus"; - #address-cells = <2>; - #size-cells = <1>; - ranges = <0 0 0 0x08000000 0x04000000>, - <1 0 0 0x14000000 0x04000000>, - <2 0 0 0x18000000 0x04000000>, - <3 0 0 0x1c000000 0x04000000>, - <4 0 0 0x0c000000 0x04000000>, - <5 0 0 0x10000000 0x04000000>; - - #interrupt-cells = <1>; - interrupt-map-mask = <0 0 15>; - interrupt-map = <0 0 0 &gic 0 68 IRQ_TYPE_LEVEL_HIGH>, - <0 0 1 &gic 0 69 IRQ_TYPE_LEVEL_HIGH>, - <0 0 2 &gic 0 70 IRQ_TYPE_LEVEL_HIGH>, - <0 0 3 &gic 0 160 IRQ_TYPE_LEVEL_HIGH>, - <0 0 4 &gic 0 161 IRQ_TYPE_LEVEL_HIGH>, - <0 0 5 &gic 0 162 IRQ_TYPE_LEVEL_HIGH>, - <0 0 6 &gic 0 163 IRQ_TYPE_LEVEL_HIGH>, - <0 0 7 &gic 0 164 IRQ_TYPE_LEVEL_HIGH>, - <0 0 8 &gic 0 165 IRQ_TYPE_LEVEL_HIGH>, - <0 0 9 &gic 0 166 IRQ_TYPE_LEVEL_HIGH>, - <0 0 10 &gic 0 167 IRQ_TYPE_LEVEL_HIGH>, - <0 0 11 &gic 0 168 IRQ_TYPE_LEVEL_HIGH>, - <0 0 12 &gic 0 169 IRQ_TYPE_LEVEL_HIGH>; - - /include/ "juno-motherboard.dtsi" - }; + #include "juno-base.dtsi" }; diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig index 2ed7449d9273..e9e4ea5cd339 100644 --- a/arch/arm64/configs/defconfig +++ b/arch/arm64/configs/defconfig @@ -45,6 +45,7 @@ CONFIG_ARCH_XGENE=y CONFIG_ARCH_ZYNQMP=y CONFIG_PCI=y CONFIG_PCI_MSI=y +CONFIG_PCI_HOST_GENERIC=y CONFIG_PCI_XGENE=y CONFIG_SMP=y CONFIG_PREEMPT=y diff --git a/arch/arm64/include/asm/acpi.h b/arch/arm64/include/asm/acpi.h index 59c05d8ea4a0..ed7e212c893c 100644 --- a/arch/arm64/include/asm/acpi.h +++ b/arch/arm64/include/asm/acpi.h @@ -12,8 +12,9 @@ #ifndef _ASM_ACPI_H #define _ASM_ACPI_H -#include <linux/mm.h> #include <linux/irqchip/arm-gic-acpi.h> +#include <linux/mm.h> +#include <linux/psci.h> #include <asm/cputype.h> #include <asm/smp_plat.h> @@ -39,18 +40,6 @@ extern int acpi_disabled; extern int acpi_noirq; extern int acpi_pci_disabled; -/* 1 to indicate PSCI 0.2+ is implemented */ -static inline bool acpi_psci_present(void) -{ - return acpi_gbl_FADT.arm_boot_flags & ACPI_FADT_PSCI_COMPLIANT; -} - -/* 1 to indicate HVC must be used instead of SMC as the PSCI conduit */ -static inline bool acpi_psci_use_hvc(void) -{ - return acpi_gbl_FADT.arm_boot_flags & ACPI_FADT_PSCI_USE_HVC; -} - static inline void disable_acpi(void) { acpi_disabled = 1; @@ -88,9 +77,11 @@ static inline void arch_fix_phys_package_id(int num, u32 slot) { } void __init acpi_init_cpus(void); #else -static inline bool acpi_psci_present(void) { return false; } -static inline bool acpi_psci_use_hvc(void) { return false; } static inline void acpi_init_cpus(void) { } #endif /* CONFIG_ACPI */ +static inline const char *acpi_get_enable_method(int cpu) +{ + return acpi_psci_present() ? "psci" : NULL; +} #endif /*_ASM_ACPI_H*/ diff --git a/arch/arm64/include/asm/alternative-asm.h b/arch/arm64/include/asm/alternative-asm.h deleted file mode 100644 index 919a67855b63..000000000000 --- a/arch/arm64/include/asm/alternative-asm.h +++ /dev/null @@ -1,29 +0,0 @@ -#ifndef __ASM_ALTERNATIVE_ASM_H -#define __ASM_ALTERNATIVE_ASM_H - -#ifdef __ASSEMBLY__ - -.macro altinstruction_entry orig_offset alt_offset feature orig_len alt_len - .word \orig_offset - . - .word \alt_offset - . - .hword \feature - .byte \orig_len - .byte \alt_len -.endm - -.macro alternative_insn insn1 insn2 cap -661: \insn1 -662: .pushsection .altinstructions, "a" - altinstruction_entry 661b, 663f, \cap, 662b-661b, 664f-663f - .popsection - .pushsection .altinstr_replacement, "ax" -663: \insn2 -664: .popsection - .if ((664b-663b) != (662b-661b)) - .error "Alternatives instruction length mismatch" - .endif -.endm - -#endif /* __ASSEMBLY__ */ - -#endif /* __ASM_ALTERNATIVE_ASM_H */ diff --git a/arch/arm64/include/asm/alternative.h b/arch/arm64/include/asm/alternative.h index d261f01e2bae..20367882226c 100644 --- a/arch/arm64/include/asm/alternative.h +++ b/arch/arm64/include/asm/alternative.h @@ -1,6 +1,9 @@ #ifndef __ASM_ALTERNATIVE_H #define __ASM_ALTERNATIVE_H +#ifndef __ASSEMBLY__ + +#include <linux/kconfig.h> #include <linux/types.h> #include <linux/stddef.h> #include <linux/stringify.h> @@ -24,8 +27,22 @@ void free_alternatives_memory(void); " .byte 662b-661b\n" /* source len */ \ " .byte 664f-663f\n" /* replacement len */ -/* alternative assembly primitive: */ -#define ALTERNATIVE(oldinstr, newinstr, feature) \ +/* + * alternative assembly primitive: + * + * If any of these .org directive fail, it means that insn1 and insn2 + * don't have the same length. This used to be written as + * + * .if ((664b-663b) != (662b-661b)) + * .error "Alternatives instruction length mismatch" + * .endif + * + * but most assemblers die if insn1 or insn2 have a .inst. This should + * be fixed in a binutils release posterior to 2.25.51.0.2 (anything + * containing commit 4e4d08cf7399b606 or c1baaddf8861). + */ +#define __ALTERNATIVE_CFG(oldinstr, newinstr, feature, cfg_enabled) \ + ".if "__stringify(cfg_enabled)" == 1\n" \ "661:\n\t" \ oldinstr "\n" \ "662:\n" \ @@ -37,8 +54,92 @@ void free_alternatives_memory(void); newinstr "\n" \ "664:\n\t" \ ".popsection\n\t" \ - ".if ((664b-663b) != (662b-661b))\n\t" \ - " .error \"Alternatives instruction length mismatch\"\n\t"\ + ".org . - (664b-663b) + (662b-661b)\n\t" \ + ".org . - (662b-661b) + (664b-663b)\n" \ ".endif\n" +#define _ALTERNATIVE_CFG(oldinstr, newinstr, feature, cfg, ...) \ + __ALTERNATIVE_CFG(oldinstr, newinstr, feature, IS_ENABLED(cfg)) + +#else + +.macro altinstruction_entry orig_offset alt_offset feature orig_len alt_len + .word \orig_offset - . + .word \alt_offset - . + .hword \feature + .byte \orig_len + .byte \alt_len +.endm + +.macro alternative_insn insn1, insn2, cap, enable = 1 + .if \enable +661: \insn1 +662: .pushsection .altinstructions, "a" + altinstruction_entry 661b, 663f, \cap, 662b-661b, 664f-663f + .popsection + .pushsection .altinstr_replacement, "ax" +663: \insn2 +664: .popsection + .org . - (664b-663b) + (662b-661b) + .org . - (662b-661b) + (664b-663b) + .endif +.endm + +/* + * Begin an alternative code sequence. + * + * The code that follows this macro will be assembled and linked as + * normal. There are no restrictions on this code. + */ +.macro alternative_if_not cap + .pushsection .altinstructions, "a" + altinstruction_entry 661f, 663f, \cap, 662f-661f, 664f-663f + .popsection +661: +.endm + +/* + * Provide the alternative code sequence. + * + * The code that follows this macro is assembled into a special + * section to be used for dynamic patching. Code that follows this + * macro must: + * + * 1. Be exactly the same length (in bytes) as the default code + * sequence. + * + * 2. Not contain a branch target that is used outside of the + * alternative sequence it is defined in (branches into an + * alternative sequence are not fixed up). + */ +.macro alternative_else +662: .pushsection .altinstr_replacement, "ax" +663: +.endm + +/* + * Complete an alternative code sequence. + */ +.macro alternative_endif +664: .popsection + .org . - (664b-663b) + (662b-661b) + .org . - (662b-661b) + (664b-663b) +.endm + +#define _ALTERNATIVE_CFG(insn1, insn2, cap, cfg, ...) \ + alternative_insn insn1, insn2, cap, IS_ENABLED(cfg) + + +#endif /* __ASSEMBLY__ */ + +/* + * Usage: asm(ALTERNATIVE(oldinstr, newinstr, feature)); + * + * Usage: asm(ALTERNATIVE(oldinstr, newinstr, feature, CONFIG_FOO)); + * N.B. If CONFIG_FOO is specified, but not selected, the whole block + * will be omitted, including oldinstr. + */ +#define ALTERNATIVE(oldinstr, newinstr, ...) \ + _ALTERNATIVE_CFG(oldinstr, newinstr, __VA_ARGS__, 1) + #endif /* __ASM_ALTERNATIVE_H */ diff --git a/arch/arm64/include/asm/assembler.h b/arch/arm64/include/asm/assembler.h index 144b64ad96c3..706ceaf2577a 100644 --- a/arch/arm64/include/asm/assembler.h +++ b/arch/arm64/include/asm/assembler.h @@ -103,9 +103,7 @@ * SMP data memory barrier */ .macro smp_dmb, opt -#ifdef CONFIG_SMP dmb \opt -#endif .endm #define USER(l, x...) \ diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h index 71f19c4dc0de..2666c9b84ca0 100644 --- a/arch/arm64/include/asm/barrier.h +++ b/arch/arm64/include/asm/barrier.h @@ -35,28 +35,6 @@ #define dma_rmb() dmb(oshld) #define dma_wmb() dmb(oshst) -#ifndef CONFIG_SMP -#define smp_mb() barrier() -#define smp_rmb() barrier() -#define smp_wmb() barrier() - -#define smp_store_release(p, v) \ -do { \ - compiletime_assert_atomic_type(*p); \ - barrier(); \ - ACCESS_ONCE(*p) = (v); \ -} while (0) - -#define smp_load_acquire(p) \ -({ \ - typeof(*p) ___p1 = ACCESS_ONCE(*p); \ - compiletime_assert_atomic_type(*p); \ - barrier(); \ - ___p1; \ -}) - -#else - #define smp_mb() dmb(ish) #define smp_rmb() dmb(ishld) #define smp_wmb() dmb(ishst) @@ -109,8 +87,6 @@ do { \ ___p1; \ }) -#endif - #define read_barrier_depends() do { } while(0) #define smp_read_barrier_depends() do { } while(0) diff --git a/arch/arm64/include/asm/cpu_ops.h b/arch/arm64/include/asm/cpu_ops.h index 5a31d6716914..8f03446cf89f 100644 --- a/arch/arm64/include/asm/cpu_ops.h +++ b/arch/arm64/include/asm/cpu_ops.h @@ -19,15 +19,15 @@ #include <linux/init.h> #include <linux/threads.h> -struct device_node; - /** * struct cpu_operations - Callback operations for hotplugging CPUs. * * @name: Name of the property as appears in a devicetree cpu node's - * enable-method property. - * @cpu_init: Reads any data necessary for a specific enable-method from the - * devicetree, for a given cpu node and proposed logical id. + * enable-method property. On systems booting with ACPI, @name + * identifies the struct cpu_operations entry corresponding to + * the boot protocol specified in the ACPI MADT table. + * @cpu_init: Reads any data necessary for a specific enable-method for a + * proposed logical id. * @cpu_prepare: Early one-time preparation step for a cpu. If there is a * mechanism for doing so, tests whether it is possible to boot * the given CPU. @@ -40,15 +40,15 @@ struct device_node; * @cpu_die: Makes a cpu leave the kernel. Must not fail. Called from the * cpu being killed. * @cpu_kill: Ensures a cpu has left the kernel. Called from another cpu. - * @cpu_init_idle: Reads any data necessary to initialize CPU idle states from - * devicetree, for a given cpu node and proposed logical id. + * @cpu_init_idle: Reads any data necessary to initialize CPU idle states for + * a proposed logical id. * @cpu_suspend: Suspends a cpu and saves the required context. May fail owing * to wrong parameters or error conditions. Called from the * CPU being suspended. Must be called with IRQs disabled. */ struct cpu_operations { const char *name; - int (*cpu_init)(struct device_node *, unsigned int); + int (*cpu_init)(unsigned int); int (*cpu_prepare)(unsigned int); int (*cpu_boot)(unsigned int); void (*cpu_postboot)(void); @@ -58,14 +58,17 @@ struct cpu_operations { int (*cpu_kill)(unsigned int cpu); #endif #ifdef CONFIG_CPU_IDLE - int (*cpu_init_idle)(struct device_node *, unsigned int); + int (*cpu_init_idle)(unsigned int); int (*cpu_suspend)(unsigned long); #endif }; extern const struct cpu_operations *cpu_ops[NR_CPUS]; -int __init cpu_read_ops(struct device_node *dn, int cpu); -void __init cpu_read_bootcpu_ops(void); -const struct cpu_operations *cpu_get_ops(const char *name); +int __init cpu_read_ops(int cpu); + +static inline void __init cpu_read_bootcpu_ops(void) +{ + cpu_read_ops(0); +} #endif /* ifndef __ASM_CPU_OPS_H */ diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h index 82cb9f98ba1a..d71140b76773 100644 --- a/arch/arm64/include/asm/cpufeature.h +++ b/arch/arm64/include/asm/cpufeature.h @@ -24,8 +24,10 @@ #define ARM64_WORKAROUND_CLEAN_CACHE 0 #define ARM64_WORKAROUND_DEVICE_LOAD_ACQUIRE 1 #define ARM64_WORKAROUND_845719 2 +#define ARM64_HAS_SYSREG_GIC_CPUIF 3 +#define ARM64_HAS_PAN 4 -#define ARM64_NCAPS 3 +#define ARM64_NCAPS 5 #ifndef __ASSEMBLY__ @@ -33,11 +35,17 @@ struct arm64_cpu_capabilities { const char *desc; u16 capability; bool (*matches)(const struct arm64_cpu_capabilities *); + void (*enable)(void); union { struct { /* To be used for erratum handling only */ u32 midr_model; u32 midr_range_min, midr_range_max; }; + + struct { /* Feature register checking */ + int field_pos; + int min_field_value; + }; }; }; @@ -64,6 +72,13 @@ static inline void cpus_set_cap(unsigned int num) __set_bit(num, cpu_hwcaps); } +static inline int __attribute_const__ cpuid_feature_extract_field(u64 features, + int field) +{ + return (s64)(features << (64 - 4 - field)) >> (64 - 4); +} + + void check_cpu_capabilities(const struct arm64_cpu_capabilities *caps, const char *info); void check_local_cpu_errata(void); diff --git a/arch/arm64/include/asm/cputype.h b/arch/arm64/include/asm/cputype.h index a84ec605bed8..ee6403df9fe4 100644 --- a/arch/arm64/include/asm/cputype.h +++ b/arch/arm64/include/asm/cputype.h @@ -81,9 +81,6 @@ #define ID_AA64MMFR0_BIGEND(mmfr0) \ (((mmfr0) & ID_AA64MMFR0_BIGEND_MASK) >> ID_AA64MMFR0_BIGEND_SHIFT) -#define SCTLR_EL1_CP15BEN (0x1 << 5) -#define SCTLR_EL1_SED (0x1 << 8) - #ifndef __ASSEMBLY__ /* diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h index 5f750dc96e0f..667346273d9b 100644 --- a/arch/arm64/include/asm/futex.h +++ b/arch/arm64/include/asm/futex.h @@ -20,10 +20,16 @@ #include <linux/futex.h> #include <linux/uaccess.h> + +#include <asm/alternative.h> +#include <asm/cpufeature.h> #include <asm/errno.h> +#include <asm/sysreg.h> #define __futex_atomic_op(insn, ret, oldval, uaddr, tmp, oparg) \ asm volatile( \ + ALTERNATIVE("nop", SET_PSTATE_PAN(0), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) \ "1: ldxr %w1, %2\n" \ insn "\n" \ "2: stlxr %w3, %w0, %2\n" \ @@ -39,6 +45,8 @@ " .align 3\n" \ " .quad 1b, 4b, 2b, 4b\n" \ " .popsection\n" \ + ALTERNATIVE("nop", SET_PSTATE_PAN(1), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) \ : "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp) \ : "r" (oparg), "Ir" (-EFAULT) \ : "memory") diff --git a/arch/arm64/include/asm/hardirq.h b/arch/arm64/include/asm/hardirq.h index 6aae421f4d73..2bb7009bdac7 100644 --- a/arch/arm64/include/asm/hardirq.h +++ b/arch/arm64/include/asm/hardirq.h @@ -24,9 +24,7 @@ typedef struct { unsigned int __softirq_pending; -#ifdef CONFIG_SMP unsigned int ipi_irqs[NR_IPI]; -#endif } ____cacheline_aligned irq_cpustat_t; #include <linux/irq_cpustat.h> /* Standard mappings for irq_cpustat_t above */ @@ -34,10 +32,8 @@ typedef struct { #define __inc_irq_stat(cpu, member) __IRQ_STAT(cpu, member)++ #define __get_irq_stat(cpu, member) __IRQ_STAT(cpu, member) -#ifdef CONFIG_SMP u64 smp_irq_stat_cpu(unsigned int cpu); #define arch_irq_stat_cpu smp_irq_stat_cpu -#endif #define __ARCH_IRQ_EXIT_IRQS_DISABLED 1 diff --git a/arch/arm64/include/asm/irq_work.h b/arch/arm64/include/asm/irq_work.h index b4f6b19a8a68..8e24ef3f7c82 100644 --- a/arch/arm64/include/asm/irq_work.h +++ b/arch/arm64/include/asm/irq_work.h @@ -1,8 +1,6 @@ #ifndef __ASM_IRQ_WORK_H #define __ASM_IRQ_WORK_H -#ifdef CONFIG_SMP - #include <asm/smp.h> static inline bool arch_irq_work_has_interrupt(void) @@ -10,13 +8,4 @@ static inline bool arch_irq_work_has_interrupt(void) return !!__smp_cross_call; } -#else - -static inline bool arch_irq_work_has_interrupt(void) -{ - return false; -} - -#endif - #endif /* __ASM_IRQ_WORK_H */ diff --git a/arch/arm64/include/asm/percpu.h b/arch/arm64/include/asm/percpu.h index 4fde8c1df97f..0a456bef8c79 100644 --- a/arch/arm64/include/asm/percpu.h +++ b/arch/arm64/include/asm/percpu.h @@ -16,8 +16,6 @@ #ifndef __ASM_PERCPU_H #define __ASM_PERCPU_H -#ifdef CONFIG_SMP - static inline void set_my_cpu_offset(unsigned long off) { asm volatile("msr tpidr_el1, %0" :: "r" (off) : "memory"); @@ -38,12 +36,6 @@ static inline unsigned long __my_cpu_offset(void) } #define __my_cpu_offset __my_cpu_offset() -#else /* !CONFIG_SMP */ - -#define set_my_cpu_offset(x) do { } while (0) - -#endif /* CONFIG_SMP */ - #define PERCPU_OP(op, asm_op) \ static inline unsigned long __percpu_##op(void *ptr, \ unsigned long val, int size) \ diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index cf7319422768..f35ac429d42b 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -53,13 +53,8 @@ extern void __pmd_error(const char *file, int line, unsigned long val); extern void __pud_error(const char *file, int line, unsigned long val); extern void __pgd_error(const char *file, int line, unsigned long val); -#ifdef CONFIG_SMP #define PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED) #define PROT_SECT_DEFAULT (PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S) -#else -#define PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF) -#define PROT_SECT_DEFAULT (PMD_TYPE_SECT | PMD_SECT_AF) -#endif #define PROT_DEVICE_nGnRE (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_ATTRINDX(MT_DEVICE_nGnRE)) #define PROT_NORMAL_NC (PROT_DEFAULT | PTE_PXN | PTE_UXN | PTE_ATTRINDX(MT_NORMAL_NC)) diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h index d2c37a1df0eb..6c2f5726fe0b 100644 --- a/arch/arm64/include/asm/processor.h +++ b/arch/arm64/include/asm/processor.h @@ -169,4 +169,6 @@ static inline void spin_lock_prefetch(const void *x) #endif +void cpu_enable_pan(void); + #endif /* __ASM_PROCESSOR_H */ diff --git a/arch/arm64/include/asm/psci.h b/arch/arm64/include/asm/psci.h deleted file mode 100644 index 2454bc59c916..000000000000 --- a/arch/arm64/include/asm/psci.h +++ /dev/null @@ -1,20 +0,0 @@ -/* - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU General Public License version 2 as - * published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU General Public License for more details. - * - * Copyright (C) 2013 ARM Limited - */ - -#ifndef __ASM_PSCI_H -#define __ASM_PSCI_H - -int psci_dt_init(void); -int psci_acpi_init(void); - -#endif /* __ASM_PSCI_H */ diff --git a/arch/arm64/include/asm/ptrace.h b/arch/arm64/include/asm/ptrace.h index d4264bb0a409..e9e5467e0bf4 100644 --- a/arch/arm64/include/asm/ptrace.h +++ b/arch/arm64/include/asm/ptrace.h @@ -183,11 +183,7 @@ static inline int valid_user_regs(struct user_pt_regs *regs) #define instruction_pointer(regs) ((unsigned long)(regs)->pc) -#ifdef CONFIG_SMP extern unsigned long profile_pc(struct pt_regs *regs); -#else -#define profile_pc(regs) instruction_pointer(regs) -#endif #endif /* __ASSEMBLY__ */ #endif diff --git a/arch/arm64/include/asm/smp.h b/arch/arm64/include/asm/smp.h index bf22650b1a78..d9c3d6a6100a 100644 --- a/arch/arm64/include/asm/smp.h +++ b/arch/arm64/include/asm/smp.h @@ -20,10 +20,6 @@ #include <linux/cpumask.h> #include <linux/thread_info.h> -#ifndef CONFIG_SMP -# error "<asm/smp.h> included in non-SMP build" -#endif - #define raw_smp_processor_id() (current_thread_info()->cpu) struct seq_file; @@ -42,7 +38,7 @@ extern void handle_IPI(int ipinr, struct pt_regs *regs); * Discover the set of possible CPUs and determine their * SMP operations. */ -extern void of_smp_init_cpus(void); +extern void smp_init_cpus(void); /* * Provide a function to raise an IPI cross call on CPUs in callmap. diff --git a/arch/arm64/include/asm/smp_plat.h b/arch/arm64/include/asm/smp_plat.h index 8dcd61e32176..7abf7570c00f 100644 --- a/arch/arm64/include/asm/smp_plat.h +++ b/arch/arm64/include/asm/smp_plat.h @@ -19,6 +19,8 @@ #ifndef __ASM_SMP_PLAT_H #define __ASM_SMP_PLAT_H +#include <linux/cpumask.h> + #include <asm/types.h> struct mpidr_hash { @@ -39,6 +41,20 @@ static inline u32 mpidr_hash_size(void) */ extern u64 __cpu_logical_map[NR_CPUS]; #define cpu_logical_map(cpu) __cpu_logical_map[cpu] +/* + * Retrieve logical cpu index corresponding to a given MPIDR.Aff* + * - mpidr: MPIDR.Aff* bits to be used for the look-up + * + * Returns the cpu logical index or -EINVAL on look-up error + */ +static inline int get_logical_index(u64 mpidr) +{ + int cpu; + for (cpu = 0; cpu < nr_cpu_ids; cpu++) + if (cpu_logical_map(cpu) == mpidr) + return cpu; + return -EINVAL; +} void __init do_post_cpus_up_work(void); diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h index 5c89df0acbcb..a7f3d4b2514d 100644 --- a/arch/arm64/include/asm/sysreg.h +++ b/arch/arm64/include/asm/sysreg.h @@ -20,8 +20,29 @@ #ifndef __ASM_SYSREG_H #define __ASM_SYSREG_H +#include <asm/opcodes.h> + +#define SCTLR_EL1_CP15BEN (0x1 << 5) +#define SCTLR_EL1_SED (0x1 << 8) + +/* + * ARMv8 ARM reserves the following encoding for system registers: + * (Ref: ARMv8 ARM, Section: "System instruction class encoding overview", + * C5.2, version:ARM DDI 0487A.f) + * [20-19] : Op0 + * [18-16] : Op1 + * [15-12] : CRn + * [11-8] : CRm + * [7-5] : Op2 + */ #define sys_reg(op0, op1, crn, crm, op2) \ - ((((op0)-2)<<19)|((op1)<<16)|((crn)<<12)|((crm)<<8)|((op2)<<5)) + ((((op0)&3)<<19)|((op1)<<16)|((crn)<<12)|((crm)<<8)|((op2)<<5)) + +#define REG_PSTATE_PAN_IMM sys_reg(0, 0, 4, 0, 4) +#define SCTLR_EL1_SPAN (1 << 23) + +#define SET_PSTATE_PAN(x) __inst_arm(0xd5000000 | REG_PSTATE_PAN_IMM |\ + (!!x)<<8 | 0x1f) #ifdef __ASSEMBLY__ @@ -31,11 +52,11 @@ .equ __reg_num_xzr, 31 .macro mrs_s, rt, sreg - .inst 0xd5300000|(\sreg)|(__reg_num_\rt) + .inst 0xd5200000|(\sreg)|(__reg_num_\rt) .endm .macro msr_s, sreg, rt - .inst 0xd5100000|(\sreg)|(__reg_num_\rt) + .inst 0xd5000000|(\sreg)|(__reg_num_\rt) .endm #else @@ -47,14 +68,23 @@ asm( " .equ __reg_num_xzr, 31\n" "\n" " .macro mrs_s, rt, sreg\n" -" .inst 0xd5300000|(\\sreg)|(__reg_num_\\rt)\n" +" .inst 0xd5200000|(\\sreg)|(__reg_num_\\rt)\n" " .endm\n" "\n" " .macro msr_s, sreg, rt\n" -" .inst 0xd5100000|(\\sreg)|(__reg_num_\\rt)\n" +" .inst 0xd5000000|(\\sreg)|(__reg_num_\\rt)\n" " .endm\n" ); +static inline void config_sctlr_el1(u32 clear, u32 set) +{ + u32 val; + + asm volatile("mrs %0, sctlr_el1" : "=r" (val)); + val &= ~clear; + val |= set; + asm volatile("msr sctlr_el1, %0" : : "r" (val)); +} #endif #endif /* __ASM_SYSREG_H */ diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h index 7ebcd31ce51c..65e9c4615664 100644 --- a/arch/arm64/include/asm/topology.h +++ b/arch/arm64/include/asm/topology.h @@ -1,8 +1,6 @@ #ifndef __ASM_TOPOLOGY_H #define __ASM_TOPOLOGY_H -#ifdef CONFIG_SMP - #include <linux/cpumask.h> struct cpu_topology { @@ -24,13 +22,6 @@ void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu); -#else - -static inline void init_cpu_topology(void) { } -static inline void store_cpu_topology(unsigned int cpuid) { } - -#endif - #include <asm-generic/topology.h> #endif /* _ASM_ARM_TOPOLOGY_H */ diff --git a/arch/arm64/include/asm/uaccess.h b/arch/arm64/include/asm/uaccess.h index 07e1ba449bf1..b2ede967fe7d 100644 --- a/arch/arm64/include/asm/uaccess.h +++ b/arch/arm64/include/asm/uaccess.h @@ -24,7 +24,10 @@ #include <linux/string.h> #include <linux/thread_info.h> +#include <asm/alternative.h> +#include <asm/cpufeature.h> #include <asm/ptrace.h> +#include <asm/sysreg.h> #include <asm/errno.h> #include <asm/memory.h> #include <asm/compiler.h> @@ -131,6 +134,8 @@ static inline void set_fs(mm_segment_t fs) do { \ unsigned long __gu_val; \ __chk_user_ptr(ptr); \ + asm(ALTERNATIVE("nop", SET_PSTATE_PAN(0), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN)); \ switch (sizeof(*(ptr))) { \ case 1: \ __get_user_asm("ldrb", "%w", __gu_val, (ptr), (err)); \ @@ -148,6 +153,8 @@ do { \ BUILD_BUG(); \ } \ (x) = (__force __typeof__(*(ptr)))__gu_val; \ + asm(ALTERNATIVE("nop", SET_PSTATE_PAN(1), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN)); \ } while (0) #define __get_user(x, ptr) \ @@ -194,6 +201,8 @@ do { \ do { \ __typeof__(*(ptr)) __pu_val = (x); \ __chk_user_ptr(ptr); \ + asm(ALTERNATIVE("nop", SET_PSTATE_PAN(0), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN)); \ switch (sizeof(*(ptr))) { \ case 1: \ __put_user_asm("strb", "%w", __pu_val, (ptr), (err)); \ @@ -210,6 +219,8 @@ do { \ default: \ BUILD_BUG(); \ } \ + asm(ALTERNATIVE("nop", SET_PSTATE_PAN(1), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN)); \ } while (0) #define __put_user(x, ptr) \ diff --git a/arch/arm64/include/uapi/asm/ptrace.h b/arch/arm64/include/uapi/asm/ptrace.h index 6913643bbe54..208db3df135a 100644 --- a/arch/arm64/include/uapi/asm/ptrace.h +++ b/arch/arm64/include/uapi/asm/ptrace.h @@ -44,6 +44,7 @@ #define PSR_I_BIT 0x00000080 #define PSR_A_BIT 0x00000100 #define PSR_D_BIT 0x00000200 +#define PSR_PAN_BIT 0x00400000 #define PSR_Q_BIT 0x08000000 #define PSR_V_BIT 0x10000000 #define PSR_C_BIT 0x20000000 diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 426d0763c81b..c24f5718d484 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -17,7 +17,8 @@ arm64-obj-y := debug-monitors.o entry.o irq.o fpsimd.o \ sys.o stacktrace.o time.o traps.o io.o vdso.o \ hyp-stub.o psci.o psci-call.o cpu_ops.o insn.o \ return_address.o cpuinfo.o cpu_errata.o \ - cpufeature.o alternative.o cacheinfo.o + cpufeature.o alternative.o cacheinfo.o \ + smp.o smp_spin_table.o topology.o arm64-obj-$(CONFIG_COMPAT) += sys32.o kuser32.o signal32.o \ sys_compat.o entry32.o \ diff --git a/arch/arm64/kernel/acpi.c b/arch/arm64/kernel/acpi.c index 8b839558838e..19de7537e7d3 100644 --- a/arch/arm64/kernel/acpi.c +++ b/arch/arm64/kernel/acpi.c @@ -36,12 +36,6 @@ EXPORT_SYMBOL(acpi_disabled); int acpi_pci_disabled = 1; /* skip ACPI PCI scan and IRQ initialization */ EXPORT_SYMBOL(acpi_pci_disabled); -/* Processors with enabled flag and sane MPIDR */ -static int enabled_cpus; - -/* Boot CPU is valid or not in MADT */ -static bool bootcpu_valid __initdata; - static bool param_acpi_off __initdata; static bool param_acpi_force __initdata; @@ -95,122 +89,15 @@ void __init __acpi_unmap_table(char *map, unsigned long size) early_memunmap(map, size); } -/** - * acpi_map_gic_cpu_interface - generates a logical cpu number - * and map to MPIDR represented by GICC structure - */ -static void __init -acpi_map_gic_cpu_interface(struct acpi_madt_generic_interrupt *processor) +bool __init acpi_psci_present(void) { - int i; - u64 mpidr = processor->arm_mpidr & MPIDR_HWID_BITMASK; - bool enabled = !!(processor->flags & ACPI_MADT_ENABLED); - - if (mpidr == INVALID_HWID) { - pr_info("Skip MADT cpu entry with invalid MPIDR\n"); - return; - } - - total_cpus++; - if (!enabled) - return; - - if (enabled_cpus >= NR_CPUS) { - pr_warn("NR_CPUS limit of %d reached, Processor %d/0x%llx ignored.\n", - NR_CPUS, total_cpus, mpidr); - return; - } - - /* Check if GICC structure of boot CPU is available in the MADT */ - if (cpu_logical_map(0) == mpidr) { - if (bootcpu_valid) { - pr_err("Firmware bug, duplicate CPU MPIDR: 0x%llx in MADT\n", - mpidr); - return; - } - - bootcpu_valid = true; - } - - /* - * Duplicate MPIDRs are a recipe for disaster. Scan - * all initialized entries and check for - * duplicates. If any is found just ignore the CPU. - */ - for (i = 1; i < enabled_cpus; i++) { - if (cpu_logical_map(i) == mpidr) { - pr_err("Firmware bug, duplicate CPU MPIDR: 0x%llx in MADT\n", - mpidr); - return; - } - } - - if (!acpi_psci_present()) - return; - - cpu_ops[enabled_cpus] = cpu_get_ops("psci"); - /* CPU 0 was already initialized */ - if (enabled_cpus) { - if (!cpu_ops[enabled_cpus]) - return; - - if (cpu_ops[enabled_cpus]->cpu_init(NULL, enabled_cpus)) - return; - - /* map the logical cpu id to cpu MPIDR */ - cpu_logical_map(enabled_cpus) = mpidr; - } - - enabled_cpus++; + return acpi_gbl_FADT.arm_boot_flags & ACPI_FADT_PSCI_COMPLIANT; } -static int __init -acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header, - const unsigned long end) +/* Whether HVC must be used instead of SMC as the PSCI conduit */ +bool __init acpi_psci_use_hvc(void) { - struct acpi_madt_generic_interrupt *processor; - - processor = (struct acpi_madt_generic_interrupt *)header; - - if (BAD_MADT_ENTRY(processor, end)) - return -EINVAL; - - acpi_table_print_madt_entry(header); - acpi_map_gic_cpu_interface(processor); - return 0; -} - -/* Parse GIC cpu interface entries in MADT for SMP init */ -void __init acpi_init_cpus(void) -{ - int count, i; - - /* - * do a partial walk of MADT to determine how many CPUs - * we have including disabled CPUs, and get information - * we need for SMP init - */ - count = acpi_table_parse_madt(ACPI_MADT_TYPE_GENERIC_INTERRUPT, - acpi_parse_gic_cpu_interface, 0); - - if (!count) { - pr_err("No GIC CPU interface entries present\n"); - return; - } else if (count < 0) { - pr_err("Error parsing GIC CPU interface entry\n"); - return; - } - - if (!bootcpu_valid) { - pr_err("MADT missing boot CPU MPIDR, not enabling secondaries\n"); - return; - } - - for (i = 0; i < enabled_cpus; i++) - set_cpu_possible(i, true); - - /* Make boot-up look pretty */ - pr_info("%d CPUs enabled, %d CPUs total\n", enabled_cpus, total_cpus); + return acpi_gbl_FADT.arm_boot_flags & ACPI_FADT_PSCI_USE_HVC; } /* diff --git a/arch/arm64/kernel/armv8_deprecated.c b/arch/arm64/kernel/armv8_deprecated.c index 7ac3920b1356..937f5e58a4d3 100644 --- a/arch/arm64/kernel/armv8_deprecated.c +++ b/arch/arm64/kernel/armv8_deprecated.c @@ -14,8 +14,11 @@ #include <linux/slab.h> #include <linux/sysctl.h> +#include <asm/alternative.h> +#include <asm/cpufeature.h> #include <asm/insn.h> #include <asm/opcodes.h> +#include <asm/sysreg.h> #include <asm/system_misc.h> #include <asm/traps.h> #include <asm/uaccess.h> @@ -279,6 +282,8 @@ static void register_insn_emulation_sysctl(struct ctl_table *table) */ #define __user_swpX_asm(data, addr, res, temp, B) \ __asm__ __volatile__( \ + ALTERNATIVE("nop", SET_PSTATE_PAN(0), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) \ "0: ldxr"B" %w2, [%3]\n" \ "1: stxr"B" %w0, %w1, [%3]\n" \ " cbz %w0, 2f\n" \ @@ -297,6 +302,8 @@ static void register_insn_emulation_sysctl(struct ctl_table *table) " .quad 0b, 4b\n" \ " .quad 1b, 4b\n" \ " .popsection\n" \ + ALTERNATIVE("nop", SET_PSTATE_PAN(1), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) \ : "=&r" (res), "+r" (data), "=&r" (temp) \ : "r" (addr), "i" (-EAGAIN), "i" (-EFAULT) \ : "memory") @@ -506,16 +513,6 @@ ret: return 0; } -static inline void config_sctlr_el1(u32 clear, u32 set) -{ - u32 val; - - asm volatile("mrs %0, sctlr_el1" : "=r" (val)); - val &= ~clear; - val |= set; - asm volatile("msr sctlr_el1, %0" : : "r" (val)); -} - static int cp15_barrier_set_hw_mode(bool enable) { if (enable) diff --git a/arch/arm64/kernel/cpu_ops.c b/arch/arm64/kernel/cpu_ops.c index fb8ff9ba467a..b6bd7d447768 100644 --- a/arch/arm64/kernel/cpu_ops.c +++ b/arch/arm64/kernel/cpu_ops.c @@ -16,11 +16,13 @@ * along with this program. If not, see <http://www.gnu.org/licenses/>. */ -#include <asm/cpu_ops.h> -#include <asm/smp_plat.h> +#include <linux/acpi.h> #include <linux/errno.h> #include <linux/of.h> #include <linux/string.h> +#include <asm/acpi.h> +#include <asm/cpu_ops.h> +#include <asm/smp_plat.h> extern const struct cpu_operations smp_spin_table_ops; extern const struct cpu_operations cpu_psci_ops; @@ -28,14 +30,12 @@ extern const struct cpu_operations cpu_psci_ops; const struct cpu_operations *cpu_ops[NR_CPUS]; static const struct cpu_operations *supported_cpu_ops[] __initconst = { -#ifdef CONFIG_SMP &smp_spin_table_ops, -#endif &cpu_psci_ops, NULL, }; -const struct cpu_operations * __init cpu_get_ops(const char *name) +static const struct cpu_operations * __init cpu_get_ops(const char *name) { const struct cpu_operations **ops = supported_cpu_ops; @@ -49,39 +49,53 @@ const struct cpu_operations * __init cpu_get_ops(const char *name) return NULL; } +static const char *__init cpu_read_enable_method(int cpu) +{ + const char *enable_method; + + if (acpi_disabled) { + struct device_node *dn = of_get_cpu_node(cpu, NULL); + + if (!dn) { + if (!cpu) + pr_err("Failed to find device node for boot cpu\n"); + return NULL; + } + + enable_method = of_get_property(dn, "enable-method", NULL); + if (!enable_method) { + /* + * The boot CPU may not have an enable method (e.g. + * when spin-table is used for secondaries). + * Don't warn spuriously. + */ + if (cpu != 0) + pr_err("%s: missing enable-method property\n", + dn->full_name); + } + } else { + enable_method = acpi_get_enable_method(cpu); + if (!enable_method) + pr_err("Unsupported ACPI enable-method\n"); + } + + return enable_method; +} /* - * Read a cpu's enable method from the device tree and record it in cpu_ops. + * Read a cpu's enable method and record it in cpu_ops. */ -int __init cpu_read_ops(struct device_node *dn, int cpu) +int __init cpu_read_ops(int cpu) { - const char *enable_method = of_get_property(dn, "enable-method", NULL); - if (!enable_method) { - /* - * The boot CPU may not have an enable method (e.g. when - * spin-table is used for secondaries). Don't warn spuriously. - */ - if (cpu != 0) - pr_err("%s: missing enable-method property\n", - dn->full_name); - return -ENOENT; - } + const char *enable_method = cpu_read_enable_method(cpu); + + if (!enable_method) + return -ENODEV; cpu_ops[cpu] = cpu_get_ops(enable_method); if (!cpu_ops[cpu]) { - pr_warn("%s: unsupported enable-method property: %s\n", - dn->full_name, enable_method); + pr_warn("Unsupported enable-method: %s\n", enable_method); return -EOPNOTSUPP; } return 0; } - -void __init cpu_read_bootcpu_ops(void) -{ - struct device_node *dn = of_get_cpu_node(0, NULL); - if (!dn) { - pr_err("Failed to find device node for boot cpu\n"); - return; - } - cpu_read_ops(dn, 0); -} diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c index 3d9967e43d89..978fa169d3c3 100644 --- a/arch/arm64/kernel/cpufeature.c +++ b/arch/arm64/kernel/cpufeature.c @@ -21,8 +21,52 @@ #include <linux/types.h> #include <asm/cpu.h> #include <asm/cpufeature.h> +#include <asm/processor.h> + +static bool +feature_matches(u64 reg, const struct arm64_cpu_capabilities *entry) +{ + int val = cpuid_feature_extract_field(reg, entry->field_pos); + + return val >= entry->min_field_value; +} + +static bool +has_id_aa64pfr0_feature(const struct arm64_cpu_capabilities *entry) +{ + u64 val; + + val = read_cpuid(id_aa64pfr0_el1); + return feature_matches(val, entry); +} + +static bool __maybe_unused +has_id_aa64mmfr1_feature(const struct arm64_cpu_capabilities *entry) +{ + u64 val; + + val = read_cpuid(id_aa64mmfr1_el1); + return feature_matches(val, entry); +} static const struct arm64_cpu_capabilities arm64_features[] = { + { + .desc = "GIC system register CPU interface", + .capability = ARM64_HAS_SYSREG_GIC_CPUIF, + .matches = has_id_aa64pfr0_feature, + .field_pos = 24, + .min_field_value = 1, + }, +#ifdef CONFIG_ARM64_PAN + { + .desc = "Privileged Access Never", + .capability = ARM64_HAS_PAN, + .matches = has_id_aa64mmfr1_feature, + .field_pos = 20, + .min_field_value = 1, + .enable = cpu_enable_pan, + }, +#endif /* CONFIG_ARM64_PAN */ {}, }; @@ -39,6 +83,12 @@ void check_cpu_capabilities(const struct arm64_cpu_capabilities *caps, pr_info("%s %s\n", info, caps[i].desc); cpus_set_cap(caps[i].capability); } + + /* second pass allows enable() to consider interacting capabilities */ + for (i = 0; caps[i].desc; i++) { + if (cpus_have_cap(caps[i].capability) && caps[i].enable) + caps[i].enable(); + } } void check_local_cpu_features(void) diff --git a/arch/arm64/kernel/cpuidle.c b/arch/arm64/kernel/cpuidle.c index 2bbd0fee084f..7ce589ca54a4 100644 --- a/arch/arm64/kernel/cpuidle.c +++ b/arch/arm64/kernel/cpuidle.c @@ -18,15 +18,10 @@ int arm_cpuidle_init(unsigned int cpu) { int ret = -EOPNOTSUPP; - struct device_node *cpu_node = of_cpu_device_node_get(cpu); - - if (!cpu_node) - return -ENODEV; if (cpu_ops[cpu] && cpu_ops[cpu]->cpu_init_idle) - ret = cpu_ops[cpu]->cpu_init_idle(cpu_node, cpu); + ret = cpu_ops[cpu]->cpu_init_idle(cpu); - of_node_put(cpu_node); return ret; } diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S index bddd04d031db..3661b12d9b26 100644 --- a/arch/arm64/kernel/entry.S +++ b/arch/arm64/kernel/entry.S @@ -21,7 +21,7 @@ #include <linux/init.h> #include <linux/linkage.h> -#include <asm/alternative-asm.h> +#include <asm/alternative.h> #include <asm/assembler.h> #include <asm/asm-offsets.h> #include <asm/cpufeature.h> diff --git a/arch/arm64/kernel/head.S b/arch/arm64/kernel/head.S index cc7435c9676e..699444ed7d86 100644 --- a/arch/arm64/kernel/head.S +++ b/arch/arm64/kernel/head.S @@ -62,13 +62,8 @@ /* * Initial memory map attributes. */ -#ifndef CONFIG_SMP -#define PTE_FLAGS PTE_TYPE_PAGE | PTE_AF -#define PMD_FLAGS PMD_TYPE_SECT | PMD_SECT_AF -#else #define PTE_FLAGS PTE_TYPE_PAGE | PTE_AF | PTE_SHARED #define PMD_FLAGS PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S -#endif #ifdef CONFIG_ARM64_64K_PAGES #define MM_MMUFLAGS PTE_ATTRINDX(MT_NORMAL) | PTE_FLAGS @@ -382,7 +377,7 @@ __create_page_tables: * Create the identity mapping. */ mov x0, x25 // idmap_pg_dir - adrp x3, KERNEL_START // __pa(KERNEL_START) + adrp x3, __idmap_text_start // __pa(__idmap_text_start) #ifndef CONFIG_ARM64_VA_BITS_48 #define EXTRA_SHIFT (PGDIR_SHIFT + PAGE_SHIFT - 3) @@ -405,11 +400,11 @@ __create_page_tables: /* * Calculate the maximum allowed value for TCR_EL1.T0SZ so that the - * entire kernel image can be ID mapped. As T0SZ == (64 - #bits used), + * entire ID map region can be mapped. As T0SZ == (64 - #bits used), * this number conveniently equals the number of leading zeroes in - * the physical address of KERNEL_END. + * the physical address of __idmap_text_end. */ - adrp x5, KERNEL_END + adrp x5, __idmap_text_end clz x5, x5 cmp x5, TCR_T0SZ(VA_BITS) // default T0SZ small enough? b.ge 1f // .. then skip additional level @@ -424,8 +419,8 @@ __create_page_tables: #endif create_pgd_entry x0, x3, x5, x6 - mov x5, x3 // __pa(KERNEL_START) - adr_l x6, KERNEL_END // __pa(KERNEL_END) + mov x5, x3 // __pa(__idmap_text_start) + adr_l x6, __idmap_text_end // __pa(__idmap_text_end) create_block_map x0, x7, x3, x5, x6 /* @@ -621,7 +616,6 @@ ENTRY(__boot_cpu_mode) .long BOOT_CPU_MODE_EL1 .popsection -#ifdef CONFIG_SMP /* * This provides a "holding pen" for platforms to hold all secondary * cores are held until we're ready for them to initialise. @@ -669,7 +663,6 @@ ENTRY(__secondary_switched) mov x29, #0 b secondary_start_kernel ENDPROC(__secondary_switched) -#endif /* CONFIG_SMP */ /* * Enable the MMU. @@ -679,6 +672,7 @@ ENDPROC(__secondary_switched) * * other registers depend on the function called upon completion */ + .section ".idmap.text", "ax" __enable_mmu: ldr x5, =vectors msr vbar_el1, x5 diff --git a/arch/arm64/kernel/irq.c b/arch/arm64/kernel/irq.c index 240b75c0e94f..de99d4bd31bb 100644 --- a/arch/arm64/kernel/irq.c +++ b/arch/arm64/kernel/irq.c @@ -33,9 +33,7 @@ unsigned long irq_err_count; int arch_show_interrupts(struct seq_file *p, int prec) { -#ifdef CONFIG_SMP show_ipi_list(p, prec); -#endif seq_printf(p, "%*s: %10lu\n", prec, "Err", irq_err_count); return 0; } diff --git a/arch/arm64/kernel/psci.c b/arch/arm64/kernel/psci.c index 24d4733b7e3c..f67f35b6edb1 100644 --- a/arch/arm64/kernel/psci.c +++ b/arch/arm64/kernel/psci.c @@ -15,183 +15,32 @@ #define pr_fmt(fmt) "psci: " fmt -#include <linux/acpi.h> #include <linux/init.h> #include <linux/of.h> #include <linux/smp.h> -#include <linux/reboot.h> -#include <linux/pm.h> #include <linux/delay.h> +#include <linux/psci.h> #include <linux/slab.h> + #include <uapi/linux/psci.h> -#include <asm/acpi.h> #include <asm/compiler.h> #include <asm/cpu_ops.h> #include <asm/errno.h> -#include <asm/psci.h> #include <asm/smp_plat.h> #include <asm/suspend.h> -#include <asm/system_misc.h> - -#define PSCI_POWER_STATE_TYPE_STANDBY 0 -#define PSCI_POWER_STATE_TYPE_POWER_DOWN 1 - -struct psci_power_state { - u16 id; - u8 type; - u8 affinity_level; -}; - -struct psci_operations { - int (*cpu_suspend)(struct psci_power_state state, - unsigned long entry_point); - int (*cpu_off)(struct psci_power_state state); - int (*cpu_on)(unsigned long cpuid, unsigned long entry_point); - int (*migrate)(unsigned long cpuid); - int (*affinity_info)(unsigned long target_affinity, - unsigned long lowest_affinity_level); - int (*migrate_info_type)(void); -}; - -static struct psci_operations psci_ops; - -static int (*invoke_psci_fn)(u64, u64, u64, u64); -typedef int (*psci_initcall_t)(const struct device_node *); - -asmlinkage int __invoke_psci_fn_hvc(u64, u64, u64, u64); -asmlinkage int __invoke_psci_fn_smc(u64, u64, u64, u64); - -enum psci_function { - PSCI_FN_CPU_SUSPEND, - PSCI_FN_CPU_ON, - PSCI_FN_CPU_OFF, - PSCI_FN_MIGRATE, - PSCI_FN_AFFINITY_INFO, - PSCI_FN_MIGRATE_INFO_TYPE, - PSCI_FN_MAX, -}; - -static DEFINE_PER_CPU_READ_MOSTLY(struct psci_power_state *, psci_power_state); - -static u32 psci_function_id[PSCI_FN_MAX]; - -static int psci_to_linux_errno(int errno) -{ - switch (errno) { - case PSCI_RET_SUCCESS: - return 0; - case PSCI_RET_NOT_SUPPORTED: - return -EOPNOTSUPP; - case PSCI_RET_INVALID_PARAMS: - return -EINVAL; - case PSCI_RET_DENIED: - return -EPERM; - }; - - return -EINVAL; -} - -static u32 psci_power_state_pack(struct psci_power_state state) -{ - return ((state.id << PSCI_0_2_POWER_STATE_ID_SHIFT) - & PSCI_0_2_POWER_STATE_ID_MASK) | - ((state.type << PSCI_0_2_POWER_STATE_TYPE_SHIFT) - & PSCI_0_2_POWER_STATE_TYPE_MASK) | - ((state.affinity_level << PSCI_0_2_POWER_STATE_AFFL_SHIFT) - & PSCI_0_2_POWER_STATE_AFFL_MASK); -} - -static void psci_power_state_unpack(u32 power_state, - struct psci_power_state *state) -{ - state->id = (power_state & PSCI_0_2_POWER_STATE_ID_MASK) >> - PSCI_0_2_POWER_STATE_ID_SHIFT; - state->type = (power_state & PSCI_0_2_POWER_STATE_TYPE_MASK) >> - PSCI_0_2_POWER_STATE_TYPE_SHIFT; - state->affinity_level = - (power_state & PSCI_0_2_POWER_STATE_AFFL_MASK) >> - PSCI_0_2_POWER_STATE_AFFL_SHIFT; -} - -static int psci_get_version(void) -{ - int err; - - err = invoke_psci_fn(PSCI_0_2_FN_PSCI_VERSION, 0, 0, 0); - return err; -} - -static int psci_cpu_suspend(struct psci_power_state state, - unsigned long entry_point) -{ - int err; - u32 fn, power_state; - - fn = psci_function_id[PSCI_FN_CPU_SUSPEND]; - power_state = psci_power_state_pack(state); - err = invoke_psci_fn(fn, power_state, entry_point, 0); - return psci_to_linux_errno(err); -} -static int psci_cpu_off(struct psci_power_state state) -{ - int err; - u32 fn, power_state; +static DEFINE_PER_CPU_READ_MOSTLY(u32 *, psci_power_state); - fn = psci_function_id[PSCI_FN_CPU_OFF]; - power_state = psci_power_state_pack(state); - err = invoke_psci_fn(fn, power_state, 0, 0); - return psci_to_linux_errno(err); -} - -static int psci_cpu_on(unsigned long cpuid, unsigned long entry_point) -{ - int err; - u32 fn; - - fn = psci_function_id[PSCI_FN_CPU_ON]; - err = invoke_psci_fn(fn, cpuid, entry_point, 0); - return psci_to_linux_errno(err); -} - -static int psci_migrate(unsigned long cpuid) -{ - int err; - u32 fn; - - fn = psci_function_id[PSCI_FN_MIGRATE]; - err = invoke_psci_fn(fn, cpuid, 0, 0); - return psci_to_linux_errno(err); -} - -static int psci_affinity_info(unsigned long target_affinity, - unsigned long lowest_affinity_level) -{ - int err; - u32 fn; - - fn = psci_function_id[PSCI_FN_AFFINITY_INFO]; - err = invoke_psci_fn(fn, target_affinity, lowest_affinity_level, 0); - return err; -} - -static int psci_migrate_info_type(void) -{ - int err; - u32 fn; - - fn = psci_function_id[PSCI_FN_MIGRATE_INFO_TYPE]; - err = invoke_psci_fn(fn, 0, 0, 0); - return err; -} - -static int __maybe_unused cpu_psci_cpu_init_idle(struct device_node *cpu_node, - unsigned int cpu) +static int __maybe_unused cpu_psci_cpu_init_idle(unsigned int cpu) { int i, ret, count = 0; - struct psci_power_state *psci_states; - struct device_node *state_node; + u32 *psci_states; + struct device_node *state_node, *cpu_node; + + cpu_node = of_get_cpu_node(cpu, NULL); + if (!cpu_node) + return -ENODEV; /* * If the PSCI cpu_suspend function hook has not been initialized @@ -215,13 +64,13 @@ static int __maybe_unused cpu_psci_cpu_init_idle(struct device_node *cpu_node, return -ENOMEM; for (i = 0; i < count; i++) { - u32 psci_power_state; + u32 state; state_node = of_parse_phandle(cpu_node, "cpu-idle-states", i); ret = of_property_read_u32(state_node, "arm,psci-suspend-param", - &psci_power_state); + &state); if (ret) { pr_warn(" * %s missing arm,psci-suspend-param property\n", state_node->full_name); @@ -230,9 +79,13 @@ static int __maybe_unused cpu_psci_cpu_init_idle(struct device_node *cpu_node, } of_node_put(state_node); - pr_debug("psci-power-state %#x index %d\n", psci_power_state, - i); - psci_power_state_unpack(psci_power_state, &psci_states[i]); + pr_debug("psci-power-state %#x index %d\n", state, i); + if (!psci_power_state_is_valid(state)) { + pr_warn("Invalid PSCI power state %#x\n", state); + ret = -EINVAL; + goto free_mem; + } + psci_states[i] = state; } /* Idle states parsed correctly, initialize per-cpu pointer */ per_cpu(psci_power_state, cpu) = psci_states; @@ -243,208 +96,7 @@ free_mem: return ret; } -static int get_set_conduit_method(struct device_node *np) -{ - const char *method; - - pr_info("probing for conduit method from DT.\n"); - - if (of_property_read_string(np, "method", &method)) { - pr_warn("missing \"method\" property\n"); - return -ENXIO; - } - - if (!strcmp("hvc", method)) { - invoke_psci_fn = __invoke_psci_fn_hvc; - } else if (!strcmp("smc", method)) { - invoke_psci_fn = __invoke_psci_fn_smc; - } else { - pr_warn("invalid \"method\" property: %s\n", method); - return -EINVAL; - } - return 0; -} - -static void psci_sys_reset(enum reboot_mode reboot_mode, const char *cmd) -{ - invoke_psci_fn(PSCI_0_2_FN_SYSTEM_RESET, 0, 0, 0); -} - -static void psci_sys_poweroff(void) -{ - invoke_psci_fn(PSCI_0_2_FN_SYSTEM_OFF, 0, 0, 0); -} - -static void __init psci_0_2_set_functions(void) -{ - pr_info("Using standard PSCI v0.2 function IDs\n"); - psci_function_id[PSCI_FN_CPU_SUSPEND] = PSCI_0_2_FN64_CPU_SUSPEND; - psci_ops.cpu_suspend = psci_cpu_suspend; - - psci_function_id[PSCI_FN_CPU_OFF] = PSCI_0_2_FN_CPU_OFF; - psci_ops.cpu_off = psci_cpu_off; - - psci_function_id[PSCI_FN_CPU_ON] = PSCI_0_2_FN64_CPU_ON; - psci_ops.cpu_on = psci_cpu_on; - - psci_function_id[PSCI_FN_MIGRATE] = PSCI_0_2_FN64_MIGRATE; - psci_ops.migrate = psci_migrate; - - psci_function_id[PSCI_FN_AFFINITY_INFO] = PSCI_0_2_FN64_AFFINITY_INFO; - psci_ops.affinity_info = psci_affinity_info; - - psci_function_id[PSCI_FN_MIGRATE_INFO_TYPE] = - PSCI_0_2_FN_MIGRATE_INFO_TYPE; - psci_ops.migrate_info_type = psci_migrate_info_type; - - arm_pm_restart = psci_sys_reset; - - pm_power_off = psci_sys_poweroff; -} - -/* - * Probe function for PSCI firmware versions >= 0.2 - */ -static int __init psci_probe(void) -{ - int ver = psci_get_version(); - - if (ver == PSCI_RET_NOT_SUPPORTED) { - /* - * PSCI versions >=0.2 mandates implementation of - * PSCI_VERSION. - */ - pr_err("PSCI firmware does not comply with the v0.2 spec.\n"); - return -EOPNOTSUPP; - } else { - pr_info("PSCIv%d.%d detected in firmware.\n", - PSCI_VERSION_MAJOR(ver), - PSCI_VERSION_MINOR(ver)); - - if (PSCI_VERSION_MAJOR(ver) == 0 && - PSCI_VERSION_MINOR(ver) < 2) { - pr_err("Conflicting PSCI version detected.\n"); - return -EINVAL; - } - } - - psci_0_2_set_functions(); - - return 0; -} - -/* - * PSCI init function for PSCI versions >=0.2 - * - * Probe based on PSCI PSCI_VERSION function - */ -static int __init psci_0_2_init(struct device_node *np) -{ - int err; - - err = get_set_conduit_method(np); - - if (err) - goto out_put_node; - /* - * Starting with v0.2, the PSCI specification introduced a call - * (PSCI_VERSION) that allows probing the firmware version, so - * that PSCI function IDs and version specific initialization - * can be carried out according to the specific version reported - * by firmware - */ - err = psci_probe(); - -out_put_node: - of_node_put(np); - return err; -} - -/* - * PSCI < v0.2 get PSCI Function IDs via DT. - */ -static int __init psci_0_1_init(struct device_node *np) -{ - u32 id; - int err; - - err = get_set_conduit_method(np); - - if (err) - goto out_put_node; - - pr_info("Using PSCI v0.1 Function IDs from DT\n"); - - if (!of_property_read_u32(np, "cpu_suspend", &id)) { - psci_function_id[PSCI_FN_CPU_SUSPEND] = id; - psci_ops.cpu_suspend = psci_cpu_suspend; - } - - if (!of_property_read_u32(np, "cpu_off", &id)) { - psci_function_id[PSCI_FN_CPU_OFF] = id; - psci_ops.cpu_off = psci_cpu_off; - } - - if (!of_property_read_u32(np, "cpu_on", &id)) { - psci_function_id[PSCI_FN_CPU_ON] = id; - psci_ops.cpu_on = psci_cpu_on; - } - - if (!of_property_read_u32(np, "migrate", &id)) { - psci_function_id[PSCI_FN_MIGRATE] = id; - psci_ops.migrate = psci_migrate; - } - -out_put_node: - of_node_put(np); - return err; -} - -static const struct of_device_id psci_of_match[] __initconst = { - { .compatible = "arm,psci", .data = psci_0_1_init}, - { .compatible = "arm,psci-0.2", .data = psci_0_2_init}, - {}, -}; - -int __init psci_dt_init(void) -{ - struct device_node *np; - const struct of_device_id *matched_np; - psci_initcall_t init_fn; - - np = of_find_matching_node_and_match(NULL, psci_of_match, &matched_np); - - if (!np) - return -ENODEV; - - init_fn = (psci_initcall_t)matched_np->data; - return init_fn(np); -} - -/* - * We use PSCI 0.2+ when ACPI is deployed on ARM64 and it's - * explicitly clarified in SBBR - */ -int __init psci_acpi_init(void) -{ - if (!acpi_psci_present()) { - pr_info("is not implemented in ACPI.\n"); - return -EOPNOTSUPP; - } - - pr_info("probing for conduit method from ACPI.\n"); - - if (acpi_psci_use_hvc()) - invoke_psci_fn = __invoke_psci_fn_hvc; - else - invoke_psci_fn = __invoke_psci_fn_smc; - - return psci_probe(); -} - -#ifdef CONFIG_SMP - -static int __init cpu_psci_cpu_init(struct device_node *dn, unsigned int cpu) +static int __init cpu_psci_cpu_init(unsigned int cpu) { return 0; } @@ -474,6 +126,11 @@ static int cpu_psci_cpu_disable(unsigned int cpu) /* Fail early if we don't have CPU_OFF support */ if (!psci_ops.cpu_off) return -EOPNOTSUPP; + + /* Trusted OS will deny CPU_OFF */ + if (psci_tos_resident_on(cpu)) + return -EPERM; + return 0; } @@ -484,9 +141,8 @@ static void cpu_psci_cpu_die(unsigned int cpu) * There are no known implementations of PSCI actually using the * power state field, pass a sensible default for now. */ - struct psci_power_state state = { - .type = PSCI_POWER_STATE_TYPE_POWER_DOWN, - }; + u32 state = PSCI_POWER_STATE_TYPE_POWER_DOWN << + PSCI_0_2_POWER_STATE_TYPE_SHIFT; ret = psci_ops.cpu_off(state); @@ -498,7 +154,7 @@ static int cpu_psci_cpu_kill(unsigned int cpu) int err, i; if (!psci_ops.affinity_info) - return 1; + return 0; /* * cpu_kill could race with cpu_die and we can * potentially end up declaring this cpu undead @@ -509,7 +165,7 @@ static int cpu_psci_cpu_kill(unsigned int cpu) err = psci_ops.affinity_info(cpu_logical_map(cpu), 0); if (err == PSCI_0_2_AFFINITY_LEVEL_OFF) { pr_info("CPU%d killed.\n", cpu); - return 1; + return 0; } msleep(10); @@ -518,15 +174,13 @@ static int cpu_psci_cpu_kill(unsigned int cpu) pr_warn("CPU%d may not have shut down cleanly (AFFINITY_INFO reports %d)\n", cpu, err); - /* Make op_cpu_kill() fail. */ - return 0; + return -ETIMEDOUT; } #endif -#endif static int psci_suspend_finisher(unsigned long index) { - struct psci_power_state *state = __this_cpu_read(psci_power_state); + u32 *state = __this_cpu_read(psci_power_state); return psci_ops.cpu_suspend(state[index - 1], virt_to_phys(cpu_resume)); @@ -535,7 +189,7 @@ static int psci_suspend_finisher(unsigned long index) static int __maybe_unused cpu_psci_cpu_suspend(unsigned long index) { int ret; - struct psci_power_state *state = __this_cpu_read(psci_power_state); + u32 *state = __this_cpu_read(psci_power_state); /* * idle state index 0 corresponds to wfi, should never be called * from the cpu_suspend operations @@ -543,7 +197,7 @@ static int __maybe_unused cpu_psci_cpu_suspend(unsigned long index) if (WARN_ON_ONCE(!index)) return -EINVAL; - if (state[index - 1].type == PSCI_POWER_STATE_TYPE_STANDBY) + if (!psci_power_state_loses_context(state[index - 1])) ret = psci_ops.cpu_suspend(state[index - 1], 0); else ret = cpu_suspend(index, psci_suspend_finisher); @@ -557,7 +211,6 @@ const struct cpu_operations cpu_psci_ops = { .cpu_init_idle = cpu_psci_cpu_init_idle, .cpu_suspend = cpu_psci_cpu_suspend, #endif -#ifdef CONFIG_SMP .cpu_init = cpu_psci_cpu_init, .cpu_prepare = cpu_psci_cpu_prepare, .cpu_boot = cpu_psci_cpu_boot, @@ -566,6 +219,5 @@ const struct cpu_operations cpu_psci_ops = { .cpu_die = cpu_psci_cpu_die, .cpu_kill = cpu_psci_cpu_kill, #endif -#endif }; diff --git a/arch/arm64/kernel/setup.c b/arch/arm64/kernel/setup.c index bbdb53b87e13..9576306fddfb 100644 --- a/arch/arm64/kernel/setup.c +++ b/arch/arm64/kernel/setup.c @@ -46,6 +46,7 @@ #include <linux/of_platform.h> #include <linux/efi.h> #include <linux/personality.h> +#include <linux/psci.h> #include <asm/acpi.h> #include <asm/fixmap.h> @@ -61,7 +62,6 @@ #include <asm/tlbflush.h> #include <asm/traps.h> #include <asm/memblock.h> -#include <asm/psci.h> #include <asm/efi.h> #include <asm/virt.h> @@ -142,7 +142,6 @@ bool arch_match_cpu_phys_id(int cpu, u64 phys_id) } struct mpidr_hash mpidr_hash; -#ifdef CONFIG_SMP /** * smp_build_mpidr_hash - Pre-compute shifts required at each affinity * level in order to build a linear index from an @@ -208,7 +207,6 @@ static void __init smp_build_mpidr_hash(void) pr_warn("Large number of MPIDR hash buckets detected\n"); __flush_dcache_area(&mpidr_hash, sizeof(struct mpidr_hash)); } -#endif static void __init hyp_mode_check(void) { @@ -408,18 +406,13 @@ void __init setup_arch(char **cmdline_p) if (acpi_disabled) { unflatten_device_tree(); psci_dt_init(); - cpu_read_bootcpu_ops(); -#ifdef CONFIG_SMP - of_smp_init_cpus(); -#endif } else { psci_acpi_init(); - acpi_init_cpus(); } -#ifdef CONFIG_SMP + cpu_read_bootcpu_ops(); + smp_init_cpus(); smp_build_mpidr_hash(); -#endif #ifdef CONFIG_VT #if defined(CONFIG_VGA_CONSOLE) @@ -519,9 +512,7 @@ static int c_show(struct seq_file *m, void *v) * online processors, looking for lines beginning with * "processor". Give glibc what it expects. */ -#ifdef CONFIG_SMP seq_printf(m, "processor\t: %d\n", i); -#endif seq_printf(m, "BogoMIPS\t: %lu.%02lu\n", loops_per_jiffy / (500000UL/HZ), diff --git a/arch/arm64/kernel/sleep.S b/arch/arm64/kernel/sleep.S index ede186cdd452..5686a3ae3940 100644 --- a/arch/arm64/kernel/sleep.S +++ b/arch/arm64/kernel/sleep.S @@ -82,7 +82,6 @@ ENTRY(__cpu_suspend_enter) str x2, [x0, #CPU_CTX_SP] ldr x1, =sleep_save_sp ldr x1, [x1, #SLEEP_SAVE_SP_VIRT] -#ifdef CONFIG_SMP mrs x7, mpidr_el1 ldr x9, =mpidr_hash ldr x10, [x9, #MPIDR_HASH_MASK] @@ -94,7 +93,6 @@ ENTRY(__cpu_suspend_enter) ldp w5, w6, [x9, #(MPIDR_HASH_SHIFTS + 8)] compute_mpidr_hash x8, x3, x4, x5, x6, x7, x10 add x1, x1, x8, lsl #3 -#endif bl __cpu_suspend_save /* * Grab suspend finisher in x20 and its argument in x19 @@ -130,12 +128,14 @@ ENDPROC(__cpu_suspend_enter) /* * x0 must contain the sctlr value retrieved from restored context */ + .pushsection ".idmap.text", "ax" ENTRY(cpu_resume_mmu) ldr x3, =cpu_resume_after_mmu msr sctlr_el1, x0 // restore sctlr_el1 isb br x3 // global jump to virtual address ENDPROC(cpu_resume_mmu) + .popsection cpu_resume_after_mmu: mov x0, #0 // return zero on success ldp x19, x20, [sp, #16] @@ -149,7 +149,6 @@ ENDPROC(cpu_resume_after_mmu) ENTRY(cpu_resume) bl el2_setup // if in EL2 drop to EL1 cleanly -#ifdef CONFIG_SMP mrs x1, mpidr_el1 adrp x8, mpidr_hash add x8, x8, #:lo12:mpidr_hash // x8 = struct mpidr_hash phys address @@ -159,18 +158,13 @@ ENTRY(cpu_resume) ldp w5, w6, [x8, #(MPIDR_HASH_SHIFTS + 8)] compute_mpidr_hash x7, x3, x4, x5, x6, x1, x2 /* x7 contains hash index, let's use it to grab context pointer */ -#else mov x7, xzr -#endif - adrp x0, sleep_save_sp - add x0, x0, #:lo12:sleep_save_sp - ldr x0, [x0, #SLEEP_SAVE_SP_PHYS] + ldr_l x0, sleep_save_sp + SLEEP_SAVE_SP_PHYS ldr x0, [x0, x7, lsl #3] /* load sp from context */ ldr x2, [x0, #CPU_CTX_SP] - adrp x1, sleep_idmap_phys /* load physical address of identity map page table in x1 */ - ldr x1, [x1, #:lo12:sleep_idmap_phys] + adrp x1, idmap_pg_dir mov sp, x2 /* * cpu_do_resume expects x0 to contain context physical address diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c index d3a202b85ba6..a1e6ed5f0d06 100644 --- a/arch/arm64/kernel/smp.c +++ b/arch/arm64/kernel/smp.c @@ -17,6 +17,7 @@ * along with this program. If not, see <http://www.gnu.org/licenses/>. */ +#include <linux/acpi.h> #include <linux/delay.h> #include <linux/init.h> #include <linux/spinlock.h> @@ -248,7 +249,7 @@ static int op_cpu_kill(unsigned int cpu) * time and hope that it's dead, so let's skip the wait and just hope. */ if (!cpu_ops[cpu]->cpu_kill) - return 1; + return 0; return cpu_ops[cpu]->cpu_kill(cpu); } @@ -261,6 +262,8 @@ static DECLARE_COMPLETION(cpu_died); */ void __cpu_die(unsigned int cpu) { + int err; + if (!wait_for_completion_timeout(&cpu_died, msecs_to_jiffies(5000))) { pr_crit("CPU%u: cpu didn't die\n", cpu); return; @@ -273,8 +276,10 @@ void __cpu_die(unsigned int cpu) * verify that it has really left the kernel before we consider * clobbering anything it might still be using. */ - if (!op_cpu_kill(cpu)) - pr_warn("CPU%d may not have shut down cleanly\n", cpu); + err = op_cpu_kill(cpu); + if (err) + pr_warn("CPU%d may not have shut down cleanly: %d\n", + cpu, err); } /* @@ -318,57 +323,158 @@ void __init smp_prepare_boot_cpu(void) set_my_cpu_offset(per_cpu_offset(smp_processor_id())); } +static u64 __init of_get_cpu_mpidr(struct device_node *dn) +{ + const __be32 *cell; + u64 hwid; + + /* + * A cpu node with missing "reg" property is + * considered invalid to build a cpu_logical_map + * entry. + */ + cell = of_get_property(dn, "reg", NULL); + if (!cell) { + pr_err("%s: missing reg property\n", dn->full_name); + return INVALID_HWID; + } + + hwid = of_read_number(cell, of_n_addr_cells(dn)); + /* + * Non affinity bits must be set to 0 in the DT + */ + if (hwid & ~MPIDR_HWID_BITMASK) { + pr_err("%s: invalid reg property\n", dn->full_name); + return INVALID_HWID; + } + return hwid; +} + +/* + * Duplicate MPIDRs are a recipe for disaster. Scan all initialized + * entries and check for duplicates. If any is found just ignore the + * cpu. cpu_logical_map was initialized to INVALID_HWID to avoid + * matching valid MPIDR values. + */ +static bool __init is_mpidr_duplicate(unsigned int cpu, u64 hwid) +{ + unsigned int i; + + for (i = 1; (i < cpu) && (i < NR_CPUS); i++) + if (cpu_logical_map(i) == hwid) + return true; + return false; +} + +/* + * Initialize cpu operations for a logical cpu and + * set it in the possible mask on success + */ +static int __init smp_cpu_setup(int cpu) +{ + if (cpu_read_ops(cpu)) + return -ENODEV; + + if (cpu_ops[cpu]->cpu_init(cpu)) + return -ENODEV; + + set_cpu_possible(cpu, true); + + return 0; +} + +static bool bootcpu_valid __initdata; +static unsigned int cpu_count = 1; + +#ifdef CONFIG_ACPI +/* + * acpi_map_gic_cpu_interface - parse processor MADT entry + * + * Carry out sanity checks on MADT processor entry and initialize + * cpu_logical_map on success + */ +static void __init +acpi_map_gic_cpu_interface(struct acpi_madt_generic_interrupt *processor) +{ + u64 hwid = processor->arm_mpidr; + + if (hwid & ~MPIDR_HWID_BITMASK || hwid == INVALID_HWID) { + pr_err("skipping CPU entry with invalid MPIDR 0x%llx\n", hwid); + return; + } + + if (!(processor->flags & ACPI_MADT_ENABLED)) { + pr_err("skipping disabled CPU entry with 0x%llx MPIDR\n", hwid); + return; + } + + if (is_mpidr_duplicate(cpu_count, hwid)) { + pr_err("duplicate CPU MPIDR 0x%llx in MADT\n", hwid); + return; + } + + /* Check if GICC structure of boot CPU is available in the MADT */ + if (cpu_logical_map(0) == hwid) { + if (bootcpu_valid) { + pr_err("duplicate boot CPU MPIDR: 0x%llx in MADT\n", + hwid); + return; + } + bootcpu_valid = true; + return; + } + + if (cpu_count >= NR_CPUS) + return; + + /* map the logical cpu id to cpu MPIDR */ + cpu_logical_map(cpu_count) = hwid; + + cpu_count++; +} + +static int __init +acpi_parse_gic_cpu_interface(struct acpi_subtable_header *header, + const unsigned long end) +{ + struct acpi_madt_generic_interrupt *processor; + + processor = (struct acpi_madt_generic_interrupt *)header; + if (BAD_MADT_ENTRY(processor, end)) + return -EINVAL; + + acpi_table_print_madt_entry(header); + + acpi_map_gic_cpu_interface(processor); + + return 0; +} +#else +#define acpi_table_parse_madt(...) do { } while (0) +#endif + /* * Enumerate the possible CPU set from the device tree and build the * cpu logical map array containing MPIDR values related to logical * cpus. Assumes that cpu_logical_map(0) has already been initialized. */ -void __init of_smp_init_cpus(void) +void __init of_parse_and_init_cpus(void) { struct device_node *dn = NULL; - unsigned int i, cpu = 1; - bool bootcpu_valid = false; while ((dn = of_find_node_by_type(dn, "cpu"))) { - const u32 *cell; - u64 hwid; + u64 hwid = of_get_cpu_mpidr(dn); - /* - * A cpu node with missing "reg" property is - * considered invalid to build a cpu_logical_map - * entry. - */ - cell = of_get_property(dn, "reg", NULL); - if (!cell) { - pr_err("%s: missing reg property\n", dn->full_name); + if (hwid == INVALID_HWID) goto next; - } - hwid = of_read_number(cell, of_n_addr_cells(dn)); - /* - * Non affinity bits must be set to 0 in the DT - */ - if (hwid & ~MPIDR_HWID_BITMASK) { - pr_err("%s: invalid reg property\n", dn->full_name); + if (is_mpidr_duplicate(cpu_count, hwid)) { + pr_err("%s: duplicate cpu reg properties in the DT\n", + dn->full_name); goto next; } /* - * Duplicate MPIDRs are a recipe for disaster. Scan - * all initialized entries and check for - * duplicates. If any is found just ignore the cpu. - * cpu_logical_map was initialized to INVALID_HWID to - * avoid matching valid MPIDR values. - */ - for (i = 1; (i < cpu) && (i < NR_CPUS); i++) { - if (cpu_logical_map(i) == hwid) { - pr_err("%s: duplicate cpu reg properties in the DT\n", - dn->full_name); - goto next; - } - } - - /* * The numbering scheme requires that the boot CPU * must be assigned logical id 0. Record it so that * the logical map built from DT is validated and can @@ -392,38 +498,58 @@ void __init of_smp_init_cpus(void) continue; } - if (cpu >= NR_CPUS) - goto next; - - if (cpu_read_ops(dn, cpu) != 0) - goto next; - - if (cpu_ops[cpu]->cpu_init(dn, cpu)) + if (cpu_count >= NR_CPUS) goto next; pr_debug("cpu logical map 0x%llx\n", hwid); - cpu_logical_map(cpu) = hwid; + cpu_logical_map(cpu_count) = hwid; next: - cpu++; + cpu_count++; } +} + +/* + * Enumerate the possible CPU set from the device tree or ACPI and build the + * cpu logical map array containing MPIDR values related to logical + * cpus. Assumes that cpu_logical_map(0) has already been initialized. + */ +void __init smp_init_cpus(void) +{ + int i; - /* sanity check */ - if (cpu > NR_CPUS) - pr_warning("no. of cores (%d) greater than configured maximum of %d - clipping\n", - cpu, NR_CPUS); + if (acpi_disabled) + of_parse_and_init_cpus(); + else + /* + * do a walk of MADT to determine how many CPUs + * we have including disabled CPUs, and get information + * we need for SMP init + */ + acpi_table_parse_madt(ACPI_MADT_TYPE_GENERIC_INTERRUPT, + acpi_parse_gic_cpu_interface, 0); + + if (cpu_count > NR_CPUS) + pr_warn("no. of cores (%d) greater than configured maximum of %d - clipping\n", + cpu_count, NR_CPUS); if (!bootcpu_valid) { - pr_err("DT missing boot CPU MPIDR, not enabling secondaries\n"); + pr_err("missing boot CPU MPIDR, not enabling secondaries\n"); return; } /* - * All the cpus that made it to the cpu_logical_map have been - * validated so set them as possible cpus. + * We need to set the cpu_logical_map entries before enabling + * the cpus so that cpu processor description entries (DT cpu nodes + * and ACPI MADT entries) can be retrieved by matching the cpu hwid + * with entries in cpu_logical_map while initializing the cpus. + * If the cpu set-up fails, invalidate the cpu_logical_map entry. */ - for (i = 0; i < NR_CPUS; i++) - if (cpu_logical_map(i) != INVALID_HWID) - set_cpu_possible(i, true); + for (i = 1; i < NR_CPUS; i++) { + if (cpu_logical_map(i) != INVALID_HWID) { + if (smp_cpu_setup(i)) + cpu_logical_map(i) = INVALID_HWID; + } + } } void __init smp_prepare_cpus(unsigned int max_cpus) diff --git a/arch/arm64/kernel/smp_spin_table.c b/arch/arm64/kernel/smp_spin_table.c index 14944e5b28da..aef3605a8c47 100644 --- a/arch/arm64/kernel/smp_spin_table.c +++ b/arch/arm64/kernel/smp_spin_table.c @@ -49,8 +49,14 @@ static void write_pen_release(u64 val) } -static int smp_spin_table_cpu_init(struct device_node *dn, unsigned int cpu) +static int smp_spin_table_cpu_init(unsigned int cpu) { + struct device_node *dn; + + dn = of_get_cpu_node(cpu, NULL); + if (!dn) + return -ENODEV; + /* * Determine the address from which the CPU is polling. */ diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c index 357418137db7..dd6ad81d53aa 100644 --- a/arch/arm64/kernel/suspend.c +++ b/arch/arm64/kernel/suspend.c @@ -132,7 +132,6 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned long)) } struct sleep_save_sp sleep_save_sp; -phys_addr_t sleep_idmap_phys; static int __init cpu_suspend_init(void) { @@ -146,9 +145,7 @@ static int __init cpu_suspend_init(void) sleep_save_sp.save_ptr_stash = ctx_ptr; sleep_save_sp.save_ptr_stash_phys = virt_to_phys(ctx_ptr); - sleep_idmap_phys = virt_to_phys(idmap_pg_dir); __flush_dcache_area(&sleep_save_sp, sizeof(struct sleep_save_sp)); - __flush_dcache_area(&sleep_idmap_phys, sizeof(sleep_idmap_phys)); return 0; } diff --git a/arch/arm64/kernel/time.c b/arch/arm64/kernel/time.c index 42f9195cf2f8..149151fb42bb 100644 --- a/arch/arm64/kernel/time.c +++ b/arch/arm64/kernel/time.c @@ -42,7 +42,6 @@ #include <asm/thread_info.h> #include <asm/stacktrace.h> -#ifdef CONFIG_SMP unsigned long profile_pc(struct pt_regs *regs) { struct stackframe frame; @@ -62,7 +61,6 @@ unsigned long profile_pc(struct pt_regs *regs) return frame.pc; } EXPORT_SYMBOL(profile_pc); -#endif void __init time_init(void) { diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c index 1ef2940df13c..0a7bc087838e 100644 --- a/arch/arm64/kernel/traps.c +++ b/arch/arm64/kernel/traps.c @@ -179,11 +179,7 @@ void show_stack(struct task_struct *tsk, unsigned long *sp) #else #define S_PREEMPT "" #endif -#ifdef CONFIG_SMP #define S_SMP " SMP" -#else -#define S_SMP "" -#endif static int __die(const char *str, int err, struct thread_info *thread, struct pt_regs *regs) diff --git a/arch/arm64/kernel/vmlinux.lds.S b/arch/arm64/kernel/vmlinux.lds.S index aff07bcad882..4d77757b5894 100644 --- a/arch/arm64/kernel/vmlinux.lds.S +++ b/arch/arm64/kernel/vmlinux.lds.S @@ -38,6 +38,12 @@ jiffies = jiffies_64; *(.hyp.text) \ VMLINUX_SYMBOL(__hyp_text_end) = .; +#define IDMAP_TEXT \ + . = ALIGN(SZ_4K); \ + VMLINUX_SYMBOL(__idmap_text_start) = .; \ + *(.idmap.text) \ + VMLINUX_SYMBOL(__idmap_text_end) = .; + /* * The size of the PE/COFF section that covers the kernel image, which * runs from stext to _edata, must be a round multiple of the PE/COFF @@ -98,6 +104,7 @@ SECTIONS SCHED_TEXT LOCK_TEXT HYPERVISOR_TEXT + IDMAP_TEXT *(.fixup) *(.gnu.warning) . = ALIGN(16); @@ -170,11 +177,13 @@ SECTIONS } /* - * The HYP init code can't be more than a page long, + * The HYP init code and ID map text can't be longer than a page each, * and should not cross a page boundary. */ ASSERT(__hyp_idmap_text_end - (__hyp_idmap_text_start & ~(SZ_4K - 1)) <= SZ_4K, "HYP init code too big or misaligned") +ASSERT(__idmap_text_end - (__idmap_text_start & ~(SZ_4K - 1)) <= SZ_4K, + "ID map text too big or misaligned") /* * If padding is applied before .head.text, virt<->phys conversions will fail. diff --git a/arch/arm64/lib/clear_user.S b/arch/arm64/lib/clear_user.S index c17967fdf5f6..a9723c71c52b 100644 --- a/arch/arm64/lib/clear_user.S +++ b/arch/arm64/lib/clear_user.S @@ -16,7 +16,11 @@ * along with this program. If not, see <http://www.gnu.org/licenses/>. */ #include <linux/linkage.h> + +#include <asm/alternative.h> #include <asm/assembler.h> +#include <asm/cpufeature.h> +#include <asm/sysreg.h> .text @@ -29,6 +33,8 @@ * Alignment fixed up by hardware. */ ENTRY(__clear_user) +ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(0)), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) mov x2, x1 // save the size for fixup return subs x1, x1, #8 b.mi 2f @@ -48,6 +54,8 @@ USER(9f, strh wzr, [x0], #2 ) b.mi 5f USER(9f, strb wzr, [x0] ) 5: mov x0, #0 +ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(1)), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) ret ENDPROC(__clear_user) diff --git a/arch/arm64/lib/copy_from_user.S b/arch/arm64/lib/copy_from_user.S index 5e27add9d362..1be9ef27be97 100644 --- a/arch/arm64/lib/copy_from_user.S +++ b/arch/arm64/lib/copy_from_user.S @@ -15,7 +15,11 @@ */ #include <linux/linkage.h> + +#include <asm/alternative.h> #include <asm/assembler.h> +#include <asm/cpufeature.h> +#include <asm/sysreg.h> /* * Copy from user space to a kernel buffer (alignment handled by the hardware) @@ -28,14 +32,21 @@ * x0 - bytes not copied */ ENTRY(__copy_from_user) - add x4, x1, x2 // upper user buffer boundary - subs x2, x2, #8 +ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(0)), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) + add x5, x1, x2 // upper user buffer boundary + subs x2, x2, #16 + b.mi 1f +0: +USER(9f, ldp x3, x4, [x1], #16) + subs x2, x2, #16 + stp x3, x4, [x0], #16 + b.pl 0b +1: adds x2, x2, #8 b.mi 2f -1: USER(9f, ldr x3, [x1], #8 ) - subs x2, x2, #8 + sub x2, x2, #8 str x3, [x0], #8 - b.pl 1b 2: adds x2, x2, #4 b.mi 3f USER(9f, ldr w3, [x1], #4 ) @@ -51,12 +62,14 @@ USER(9f, ldrh w3, [x1], #2 ) USER(9f, ldrb w3, [x1] ) strb w3, [x0] 5: mov x0, #0 +ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(1)), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) ret ENDPROC(__copy_from_user) .section .fixup,"ax" .align 2 -9: sub x2, x4, x1 +9: sub x2, x5, x1 mov x3, x2 10: strb wzr, [x0], #1 // zero remaining buffer space subs x3, x3, #1 diff --git a/arch/arm64/lib/copy_in_user.S b/arch/arm64/lib/copy_in_user.S index 84b6c9bb9b93..1b94661e22b3 100644 --- a/arch/arm64/lib/copy_in_user.S +++ b/arch/arm64/lib/copy_in_user.S @@ -17,7 +17,11 @@ */ #include <linux/linkage.h> + +#include <asm/alternative.h> #include <asm/assembler.h> +#include <asm/cpufeature.h> +#include <asm/sysreg.h> /* * Copy from user space to user space (alignment handled by the hardware) @@ -30,14 +34,21 @@ * x0 - bytes not copied */ ENTRY(__copy_in_user) - add x4, x0, x2 // upper user buffer boundary - subs x2, x2, #8 +ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(0)), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) + add x5, x0, x2 // upper user buffer boundary + subs x2, x2, #16 + b.mi 1f +0: +USER(9f, ldp x3, x4, [x1], #16) + subs x2, x2, #16 +USER(9f, stp x3, x4, [x0], #16) + b.pl 0b +1: adds x2, x2, #8 b.mi 2f -1: USER(9f, ldr x3, [x1], #8 ) - subs x2, x2, #8 + sub x2, x2, #8 USER(9f, str x3, [x0], #8 ) - b.pl 1b 2: adds x2, x2, #4 b.mi 3f USER(9f, ldr w3, [x1], #4 ) @@ -53,11 +64,13 @@ USER(9f, strh w3, [x0], #2 ) USER(9f, ldrb w3, [x1] ) USER(9f, strb w3, [x0] ) 5: mov x0, #0 +ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(1)), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) ret ENDPROC(__copy_in_user) .section .fixup,"ax" .align 2 -9: sub x0, x4, x0 // bytes not copied +9: sub x0, x5, x0 // bytes not copied ret .previous diff --git a/arch/arm64/lib/copy_to_user.S b/arch/arm64/lib/copy_to_user.S index a0aeeb9b7a28..a257b47e2dc4 100644 --- a/arch/arm64/lib/copy_to_user.S +++ b/arch/arm64/lib/copy_to_user.S @@ -15,7 +15,11 @@ */ #include <linux/linkage.h> + +#include <asm/alternative.h> #include <asm/assembler.h> +#include <asm/cpufeature.h> +#include <asm/sysreg.h> /* * Copy to user space from a kernel buffer (alignment handled by the hardware) @@ -28,14 +32,21 @@ * x0 - bytes not copied */ ENTRY(__copy_to_user) - add x4, x0, x2 // upper user buffer boundary - subs x2, x2, #8 +ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(0)), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) + add x5, x0, x2 // upper user buffer boundary + subs x2, x2, #16 + b.mi 1f +0: + ldp x3, x4, [x1], #16 + subs x2, x2, #16 +USER(9f, stp x3, x4, [x0], #16) + b.pl 0b +1: adds x2, x2, #8 b.mi 2f -1: ldr x3, [x1], #8 - subs x2, x2, #8 + sub x2, x2, #8 USER(9f, str x3, [x0], #8 ) - b.pl 1b 2: adds x2, x2, #4 b.mi 3f ldr w3, [x1], #4 @@ -51,11 +62,13 @@ USER(9f, strh w3, [x0], #2 ) ldrb w3, [x1] USER(9f, strb w3, [x0] ) 5: mov x0, #0 +ALTERNATIVE("nop", __stringify(SET_PSTATE_PAN(1)), ARM64_HAS_PAN, \ + CONFIG_ARM64_PAN) ret ENDPROC(__copy_to_user) .section .fixup,"ax" .align 2 -9: sub x0, x4, x0 // bytes not copied +9: sub x0, x5, x0 // bytes not copied ret .previous diff --git a/arch/arm64/mm/cache.S b/arch/arm64/mm/cache.S index 2560e1e1562e..70a79cb6d504 100644 --- a/arch/arm64/mm/cache.S +++ b/arch/arm64/mm/cache.S @@ -22,7 +22,7 @@ #include <linux/init.h> #include <asm/assembler.h> #include <asm/cpufeature.h> -#include <asm/alternative-asm.h> +#include <asm/alternative.h> #include "proc-macros.S" diff --git a/arch/arm64/mm/context.c b/arch/arm64/mm/context.c index 76c1e6cd36fc..d70ff14dbdbd 100644 --- a/arch/arm64/mm/context.c +++ b/arch/arm64/mm/context.c @@ -53,8 +53,6 @@ static void flush_context(void) __flush_icache_all(); } -#ifdef CONFIG_SMP - static void set_mm_context(struct mm_struct *mm, unsigned int asid) { unsigned long flags; @@ -110,23 +108,12 @@ static void reset_context(void *info) cpu_switch_mm(mm->pgd, mm); } -#else - -static inline void set_mm_context(struct mm_struct *mm, unsigned int asid) -{ - mm->context.id = asid; - cpumask_copy(mm_cpumask(mm), cpumask_of(smp_processor_id())); -} - -#endif - void __new_context(struct mm_struct *mm) { unsigned int asid; unsigned int bits = asid_bits(); raw_spin_lock(&cpu_asid_lock); -#ifdef CONFIG_SMP /* * Check the ASID again, in case the change was broadcast from another * CPU before we acquired the lock. @@ -136,7 +123,6 @@ void __new_context(struct mm_struct *mm) raw_spin_unlock(&cpu_asid_lock); return; } -#endif /* * At this point, it is guaranteed that the current mm (with an old * ASID) isn't active on any other CPU since the ASIDs are changed @@ -155,10 +141,8 @@ void __new_context(struct mm_struct *mm) cpu_last_asid = ASID_FIRST_VERSION; asid = cpu_last_asid + smp_processor_id(); flush_context(); -#ifdef CONFIG_SMP smp_wmb(); smp_call_function(reset_context, NULL, 1); -#endif cpu_last_asid += NR_CPUS - 1; } diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index fa5efaa5c3ac..10a1fc5004dc 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -30,9 +30,11 @@ #include <linux/highmem.h> #include <linux/perf_event.h> +#include <asm/cpufeature.h> #include <asm/exception.h> #include <asm/debug-monitors.h> #include <asm/esr.h> +#include <asm/sysreg.h> #include <asm/system_misc.h> #include <asm/pgtable.h> #include <asm/tlbflush.h> @@ -225,6 +227,13 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr, } /* + * PAN bit set implies the fault happened in kernel space, but not + * in the arch's user access functions. + */ + if (IS_ENABLED(CONFIG_ARM64_PAN) && (regs->pstate & PSR_PAN_BIT)) + goto no_context; + + /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, * we can bug out early if this is from code which shouldn't. @@ -531,3 +540,10 @@ asmlinkage int __exception do_debug_exception(unsigned long addr, return 0; } + +#ifdef CONFIG_ARM64_PAN +void cpu_enable_pan(void) +{ + config_sctlr_el1(SCTLR_EL1_SPAN, 0); +} +#endif /* CONFIG_ARM64_PAN */ diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c index b6f14e8d2121..07844184975b 100644 --- a/arch/arm64/mm/flush.c +++ b/arch/arm64/mm/flush.c @@ -60,14 +60,10 @@ void copy_to_user_page(struct vm_area_struct *vma, struct page *page, unsigned long uaddr, void *dst, const void *src, unsigned long len) { -#ifdef CONFIG_SMP preempt_disable(); -#endif memcpy(dst, src, len); flush_ptrace_access(vma, page, uaddr, dst, len); -#ifdef CONFIG_SMP preempt_enable(); -#endif } void __sync_icache_dcache(pte_t pte, unsigned long addr) diff --git a/arch/arm64/mm/proc.S b/arch/arm64/mm/proc.S index d253908a988d..e1d0e0a292bb 100644 --- a/arch/arm64/mm/proc.S +++ b/arch/arm64/mm/proc.S @@ -34,11 +34,7 @@ #define TCR_TG_FLAGS TCR_TG0_4K | TCR_TG1_4K #endif -#ifdef CONFIG_SMP #define TCR_SMP_FLAGS TCR_SHARED -#else -#define TCR_SMP_FLAGS 0 -#endif /* PTWs cacheable, inner/outer WBWA */ #define TCR_CACHE_FLAGS TCR_IRGN_WBWA | TCR_ORGN_WBWA diff --git a/arch/powerpc/platforms/512x/mpc5121_ads_cpld.c b/arch/powerpc/platforms/512x/mpc5121_ads_cpld.c index ca3a062ed1b9..11090ab4bf59 100644 --- a/arch/powerpc/platforms/512x/mpc5121_ads_cpld.c +++ b/arch/powerpc/platforms/512x/mpc5121_ads_cpld.c @@ -123,7 +123,8 @@ cpld_pic_cascade(unsigned int irq, struct irq_desc *desc) } static int -cpld_pic_host_match(struct irq_domain *h, struct device_node *node) +cpld_pic_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { return cpld_pic_node == node; } diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c index 3af8324c122e..a15f1efc295f 100644 --- a/arch/powerpc/platforms/cell/interrupt.c +++ b/arch/powerpc/platforms/cell/interrupt.c @@ -222,7 +222,8 @@ void iic_request_IPIs(void) #endif /* CONFIG_SMP */ -static int iic_host_match(struct irq_domain *h, struct device_node *node) +static int iic_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { return of_device_is_compatible(node, "IBM,CBEA-Internal-Interrupt-Controller"); diff --git a/arch/powerpc/platforms/embedded6xx/flipper-pic.c b/arch/powerpc/platforms/embedded6xx/flipper-pic.c index 4cde8e7da4b8..b7866e01483d 100644 --- a/arch/powerpc/platforms/embedded6xx/flipper-pic.c +++ b/arch/powerpc/platforms/embedded6xx/flipper-pic.c @@ -108,7 +108,8 @@ static int flipper_pic_map(struct irq_domain *h, unsigned int virq, return 0; } -static int flipper_pic_match(struct irq_domain *h, struct device_node *np) +static int flipper_pic_match(struct irq_domain *h, struct device_node *np, + enum irq_domain_bus_token bus_token) { return 1; } diff --git a/arch/powerpc/platforms/powermac/pic.c b/arch/powerpc/platforms/powermac/pic.c index 59cfc9d63c2d..6f4f8b060def 100644 --- a/arch/powerpc/platforms/powermac/pic.c +++ b/arch/powerpc/platforms/powermac/pic.c @@ -268,7 +268,8 @@ static struct irqaction gatwick_cascade_action = { .name = "cascade", }; -static int pmac_pic_host_match(struct irq_domain *h, struct device_node *node) +static int pmac_pic_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { /* We match all, we don't always have a node anyway */ return 1; diff --git a/arch/powerpc/platforms/powernv/opal-irqchip.c b/arch/powerpc/platforms/powernv/opal-irqchip.c new file mode 100644 index 000000000000..2c91ee7800b9 --- /dev/null +++ b/arch/powerpc/platforms/powernv/opal-irqchip.c @@ -0,0 +1,254 @@ +/* + * This file implements an irqchip for OPAL events. Whenever there is + * an interrupt that is handled by OPAL we get passed a list of events + * that Linux needs to do something about. These basically look like + * interrupts to Linux so we implement an irqchip to handle them. + * + * Copyright Alistair Popple, IBM Corporation 2014. + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; either version 2 of the License, or (at your + * option) any later version. + */ +#include <linux/bitops.h> +#include <linux/irq.h> +#include <linux/irqchip.h> +#include <linux/irqdomain.h> +#include <linux/interrupt.h> +#include <linux/module.h> +#include <linux/of.h> +#include <linux/platform_device.h> +#include <linux/kthread.h> +#include <linux/delay.h> +#include <linux/slab.h> +#include <linux/irq_work.h> + +#include <asm/machdep.h> +#include <asm/opal.h> + +#include "powernv.h" + +/* Maximum number of events supported by OPAL firmware */ +#define MAX_NUM_EVENTS 64 + +struct opal_event_irqchip { + struct irq_chip irqchip; + struct irq_domain *domain; + unsigned long mask; +}; +static struct opal_event_irqchip opal_event_irqchip; + +static unsigned int opal_irq_count; +static unsigned int *opal_irqs; + +static void opal_handle_irq_work(struct irq_work *work); +static __be64 last_outstanding_events; +static struct irq_work opal_event_irq_work = { + .func = opal_handle_irq_work, +}; + +static void opal_event_mask(struct irq_data *d) +{ + clear_bit(d->hwirq, &opal_event_irqchip.mask); +} + +static void opal_event_unmask(struct irq_data *d) +{ + set_bit(d->hwirq, &opal_event_irqchip.mask); + + opal_poll_events(&last_outstanding_events); + if (last_outstanding_events & opal_event_irqchip.mask) + /* Need to retrigger the interrupt */ + irq_work_queue(&opal_event_irq_work); +} + +static int opal_event_set_type(struct irq_data *d, unsigned int flow_type) +{ + /* + * For now we only support level triggered events. The irq + * handler will be called continuously until the event has + * been cleared in OPAL. + */ + if (flow_type != IRQ_TYPE_LEVEL_HIGH) + return -EINVAL; + + return 0; +} + +static struct opal_event_irqchip opal_event_irqchip = { + .irqchip = { + .name = "OPAL EVT", + .irq_mask = opal_event_mask, + .irq_unmask = opal_event_unmask, + .irq_set_type = opal_event_set_type, + }, + .mask = 0, +}; + +static int opal_event_map(struct irq_domain *d, unsigned int irq, + irq_hw_number_t hwirq) +{ + irq_set_chip_data(irq, &opal_event_irqchip); + irq_set_chip_and_handler(irq, &opal_event_irqchip.irqchip, + handle_level_irq); + + return 0; +} + +void opal_handle_events(uint64_t events) +{ + int virq, hwirq = 0; + u64 mask = opal_event_irqchip.mask; + + if (!in_irq() && (events & mask)) { + last_outstanding_events = events; + irq_work_queue(&opal_event_irq_work); + return; + } + + while (events & mask) { + hwirq = fls64(events) - 1; + if (BIT_ULL(hwirq) & mask) { + virq = irq_find_mapping(opal_event_irqchip.domain, + hwirq); + if (virq) + generic_handle_irq(virq); + } + events &= ~BIT_ULL(hwirq); + } +} + +static irqreturn_t opal_interrupt(int irq, void *data) +{ + __be64 events; + + opal_handle_interrupt(virq_to_hw(irq), &events); + opal_handle_events(be64_to_cpu(events)); + + return IRQ_HANDLED; +} + +static void opal_handle_irq_work(struct irq_work *work) +{ + opal_handle_events(be64_to_cpu(last_outstanding_events)); +} + +static int opal_event_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) +{ + return h->of_node == node; +} + +static int opal_event_xlate(struct irq_domain *h, struct device_node *np, + const u32 *intspec, unsigned int intsize, + irq_hw_number_t *out_hwirq, unsigned int *out_flags) +{ + *out_hwirq = intspec[0]; + *out_flags = IRQ_TYPE_LEVEL_HIGH; + + return 0; +} + +static const struct irq_domain_ops opal_event_domain_ops = { + .match = opal_event_match, + .map = opal_event_map, + .xlate = opal_event_xlate, +}; + +void opal_event_shutdown(void) +{ + unsigned int i; + + /* First free interrupts, which will also mask them */ + for (i = 0; i < opal_irq_count; i++) { + if (opal_irqs[i]) + free_irq(opal_irqs[i], NULL); + opal_irqs[i] = 0; + } +} + +int __init opal_event_init(void) +{ + struct device_node *dn, *opal_node; + const __be32 *irqs; + int i, irqlen, rc = 0; + + opal_node = of_find_node_by_path("/ibm,opal"); + if (!opal_node) { + pr_warn("opal: Node not found\n"); + return -ENODEV; + } + + /* If dn is NULL it means the domain won't be linked to a DT + * node so therefore irq_of_parse_and_map(...) wont work. But + * that shouldn't be problem because if we're running a + * version of skiboot that doesn't have the dn then the + * devices won't have the correct properties and will have to + * fall back to the legacy method (opal_event_request(...)) + * anyway. */ + dn = of_find_compatible_node(NULL, NULL, "ibm,opal-event"); + opal_event_irqchip.domain = irq_domain_add_linear(dn, MAX_NUM_EVENTS, + &opal_event_domain_ops, &opal_event_irqchip); + of_node_put(dn); + if (!opal_event_irqchip.domain) { + pr_warn("opal: Unable to create irq domain\n"); + rc = -ENOMEM; + goto out; + } + + /* Get interrupt property */ + irqs = of_get_property(opal_node, "opal-interrupts", &irqlen); + opal_irq_count = irqs ? (irqlen / 4) : 0; + pr_debug("Found %d interrupts reserved for OPAL\n", opal_irq_count); + + /* Install interrupt handlers */ + opal_irqs = kcalloc(opal_irq_count, sizeof(*opal_irqs), GFP_KERNEL); + for (i = 0; irqs && i < opal_irq_count; i++, irqs++) { + unsigned int irq, virq; + + /* Get hardware and virtual IRQ */ + irq = be32_to_cpup(irqs); + virq = irq_create_mapping(NULL, irq); + if (virq == NO_IRQ) { + pr_warn("Failed to map irq 0x%x\n", irq); + continue; + } + + /* Install interrupt handler */ + rc = request_irq(virq, opal_interrupt, 0, "opal", NULL); + if (rc) { + irq_dispose_mapping(virq); + pr_warn("Error %d requesting irq %d (0x%x)\n", + rc, virq, irq); + continue; + } + + /* Cache IRQ */ + opal_irqs[i] = virq; + } + +out: + of_node_put(opal_node); + return rc; +} +machine_arch_initcall(powernv, opal_event_init); + +/** + * opal_event_request(unsigned int opal_event_nr) - Request an event + * @opal_event_nr: the opal event number to request + * + * This routine can be used to find the linux virq number which can + * then be passed to request_irq to assign a handler for a particular + * opal event. This should only be used by legacy devices which don't + * have proper device tree bindings. Most devices should use + * irq_of_parse_and_map() instead. + */ +int opal_event_request(unsigned int opal_event_nr) +{ + if (WARN_ON_ONCE(!opal_event_irqchip.domain)) + return NO_IRQ; + + return irq_create_mapping(opal_event_irqchip.domain, opal_event_nr); +} +EXPORT_SYMBOL(opal_event_request); diff --git a/arch/powerpc/platforms/ps3/interrupt.c b/arch/powerpc/platforms/ps3/interrupt.c index a6c42f34303a..638c4060938e 100644 --- a/arch/powerpc/platforms/ps3/interrupt.c +++ b/arch/powerpc/platforms/ps3/interrupt.c @@ -678,7 +678,8 @@ static int ps3_host_map(struct irq_domain *h, unsigned int virq, return 0; } -static int ps3_host_match(struct irq_domain *h, struct device_node *np) +static int ps3_host_match(struct irq_domain *h, struct device_node *np, + enum irq_domain_bus_token bus_token) { /* Match all */ return 1; diff --git a/arch/powerpc/sysdev/ehv_pic.c b/arch/powerpc/sysdev/ehv_pic.c index 2d20f10a4203..eca0b00794fa 100644 --- a/arch/powerpc/sysdev/ehv_pic.c +++ b/arch/powerpc/sysdev/ehv_pic.c @@ -177,7 +177,8 @@ unsigned int ehv_pic_get_irq(void) return irq_linear_revmap(global_ehv_pic->irqhost, irq); } -static int ehv_pic_host_match(struct irq_domain *h, struct device_node *node) +static int ehv_pic_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { /* Exact match, unless ehv_pic node is NULL */ return h->of_node == NULL || h->of_node == node; diff --git a/arch/powerpc/sysdev/i8259.c b/arch/powerpc/sysdev/i8259.c index 45598da0b321..8c3756cbc4f9 100644 --- a/arch/powerpc/sysdev/i8259.c +++ b/arch/powerpc/sysdev/i8259.c @@ -162,7 +162,8 @@ static struct resource pic_edgectrl_iores = { .flags = IORESOURCE_BUSY, }; -static int i8259_host_match(struct irq_domain *h, struct device_node *node) +static int i8259_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { return h->of_node == NULL || h->of_node == node; } diff --git a/arch/powerpc/sysdev/ipic.c b/arch/powerpc/sysdev/ipic.c index b28733727ed3..27d317a4856d 100644 --- a/arch/powerpc/sysdev/ipic.c +++ b/arch/powerpc/sysdev/ipic.c @@ -671,7 +671,8 @@ static struct irq_chip ipic_edge_irq_chip = { .irq_set_type = ipic_set_irq_type, }; -static int ipic_host_match(struct irq_domain *h, struct device_node *node) +static int ipic_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { /* Exact match, unless ipic node is NULL */ return h->of_node == NULL || h->of_node == node; diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c index b2b8447a227a..8b6e282479b8 100644 --- a/arch/powerpc/sysdev/mpic.c +++ b/arch/powerpc/sysdev/mpic.c @@ -1007,7 +1007,8 @@ static struct irq_chip mpic_irq_ht_chip = { #endif /* CONFIG_MPIC_U3_HT_IRQS */ -static int mpic_host_match(struct irq_domain *h, struct device_node *node) +static int mpic_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { /* Exact match, unless mpic node is NULL */ return h->of_node == NULL || h->of_node == node; diff --git a/arch/powerpc/sysdev/qe_lib/qe_ic.c b/arch/powerpc/sysdev/qe_lib/qe_ic.c index 543765e1ef14..da00c9e761e1 100644 --- a/arch/powerpc/sysdev/qe_lib/qe_ic.c +++ b/arch/powerpc/sysdev/qe_lib/qe_ic.c @@ -244,7 +244,8 @@ static struct irq_chip qe_ic_irq_chip = { .irq_mask_ack = qe_ic_mask_irq, }; -static int qe_ic_host_match(struct irq_domain *h, struct device_node *node) +static int qe_ic_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { /* Exact match, unless qe_ic node is NULL */ return h->of_node == NULL || h->of_node == node; diff --git a/arch/powerpc/sysdev/xics/xics-common.c b/arch/powerpc/sysdev/xics/xics-common.c index 878a54036a25..41e334db8847 100644 --- a/arch/powerpc/sysdev/xics/xics-common.c +++ b/arch/powerpc/sysdev/xics/xics-common.c @@ -298,7 +298,8 @@ int xics_get_irq_server(unsigned int virq, const struct cpumask *cpumask, } #endif /* CONFIG_SMP */ -static int xics_host_match(struct irq_domain *h, struct device_node *node) +static int xics_host_match(struct irq_domain *h, struct device_node *node, + enum irq_domain_bus_token bus_token) { struct ics *ics; diff --git a/block/bio.c b/block/bio.c index cbce3e2208f4..167e5ec31717 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1983,6 +1983,29 @@ struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int front_ EXPORT_SYMBOL(bioset_create_nobvec); #ifdef CONFIG_BLK_CGROUP + +/** + * bio_associate_blkcg - associate a bio with the specified blkcg + * @bio: target bio + * @blkcg_css: css of the blkcg to associate + * + * Associate @bio with the blkcg specified by @blkcg_css. Block layer will + * treat @bio as if it were issued by a task which belongs to the blkcg. + * + * This function takes an extra reference of @blkcg_css which will be put + * when @bio is released. The caller must own @bio and is responsible for + * synchronizing calls to this function. + */ +int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css) +{ + if (unlikely(bio->bi_css)) + return -EBUSY; + css_get(blkcg_css); + bio->bi_css = blkcg_css; + return 0; +} +EXPORT_SYMBOL_GPL(bio_associate_blkcg); + /** * bio_associate_current - associate a bio with %current * @bio: target bio @@ -1999,28 +2022,20 @@ EXPORT_SYMBOL(bioset_create_nobvec); int bio_associate_current(struct bio *bio) { struct io_context *ioc; - struct cgroup_subsys_state *css; - if (bio->bi_ioc) + if (bio->bi_css) return -EBUSY; ioc = current->io_context; if (!ioc) return -ENOENT; - /* acquire active ref on @ioc and associate */ get_io_context_active(ioc); bio->bi_ioc = ioc; - - /* associate blkcg if exists */ - rcu_read_lock(); - css = task_css(current, blkio_cgrp_id); - if (css && css_tryget_online(css)) - bio->bi_css = css; - rcu_read_unlock(); - + bio->bi_css = task_get_css(current, blkio_cgrp_id); return 0; } +EXPORT_SYMBOL_GPL(bio_associate_current); /** * bio_disassociate_task - undo bio_associate_current() diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 6817e28960b7..7040eddc4bbf 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -15,11 +15,12 @@ #include <linux/module.h> #include <linux/err.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/slab.h> #include <linux/genhd.h> #include <linux/delay.h> #include <linux/atomic.h> -#include "blk-cgroup.h" +#include <linux/blk-cgroup.h> #include "blk.h" #define MAX_KEY_LEN 100 @@ -30,6 +31,8 @@ struct blkcg blkcg_root = { .cfq_weight = 2 * CFQ_WEIGHT_DEFAULT, .cfq_leaf_weight = 2 * CFQ_WEIGHT_DEFAULT, }; EXPORT_SYMBOL_GPL(blkcg_root); +struct cgroup_subsys_state * const blkcg_root_css = &blkcg_root.css; + static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS]; static bool blkcg_policy_enabled(struct request_queue *q, @@ -179,6 +182,7 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, struct blkcg_gq *new_blkg) { struct blkcg_gq *blkg; + struct bdi_writeback_congested *wb_congested; int i, ret; WARN_ON_ONCE(!rcu_read_lock_held()); @@ -190,22 +194,30 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, goto err_free_blkg; } + wb_congested = wb_congested_get_create(&q->backing_dev_info, + blkcg->css.id, GFP_ATOMIC); + if (!wb_congested) { + ret = -ENOMEM; + goto err_put_css; + } + /* allocate */ if (!new_blkg) { new_blkg = blkg_alloc(blkcg, q, GFP_ATOMIC); if (unlikely(!new_blkg)) { ret = -ENOMEM; - goto err_put_css; + goto err_put_congested; } } blkg = new_blkg; + blkg->wb_congested = wb_congested; /* link parent */ if (blkcg_parent(blkcg)) { blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false); if (WARN_ON_ONCE(!blkg->parent)) { ret = -EINVAL; - goto err_put_css; + goto err_put_congested; } blkg_get(blkg->parent); } @@ -235,18 +247,15 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, blkg->online = true; spin_unlock(&blkcg->lock); - if (!ret) { - if (blkcg == &blkcg_root) { - q->root_blkg = blkg; - q->root_rl.blkg = blkg; - } + if (!ret) return blkg; - } /* @blkg failed fully initialized, use the usual release path */ blkg_put(blkg); return ERR_PTR(ret); +err_put_congested: + wb_congested_put(wb_congested); err_put_css: css_put(&blkcg->css); err_free_blkg: @@ -340,15 +349,6 @@ static void blkg_destroy(struct blkcg_gq *blkg) rcu_assign_pointer(blkcg->blkg_hint, NULL); /* - * If root blkg is destroyed. Just clear the pointer since root_rl - * does not take reference on root blkg. - */ - if (blkcg == &blkcg_root) { - blkg->q->root_blkg = NULL; - blkg->q->root_rl.blkg = NULL; - } - - /* * Put the reference taken at the time of creation so that when all * queues are gone, group can be destroyed. */ @@ -402,6 +402,8 @@ void __blkg_release_rcu(struct rcu_head *rcu_head) if (blkg->parent) blkg_put(blkg->parent); + wb_congested_put(blkg->wb_congested); + blkg_free(blkg); } EXPORT_SYMBOL_GPL(__blkg_release_rcu); @@ -813,6 +815,8 @@ static void blkcg_css_offline(struct cgroup_subsys_state *css) } spin_unlock_irq(&blkcg->lock); + + wb_blkcg_offline(blkcg); } static void blkcg_css_free(struct cgroup_subsys_state *css) @@ -843,7 +847,9 @@ done: spin_lock_init(&blkcg->lock); INIT_RADIX_TREE(&blkcg->blkg_tree, GFP_ATOMIC); INIT_HLIST_HEAD(&blkcg->blkg_list); - +#ifdef CONFIG_CGROUP_WRITEBACK + INIT_LIST_HEAD(&blkcg->cgwb_list); +#endif return &blkcg->css; } @@ -859,9 +865,45 @@ done: */ int blkcg_init_queue(struct request_queue *q) { - might_sleep(); + struct blkcg_gq *new_blkg, *blkg; + bool preloaded; + int ret; + + new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL); + if (!new_blkg) + return -ENOMEM; + + preloaded = !radix_tree_preload(GFP_KERNEL); + + /* + * Make sure the root blkg exists and count the existing blkgs. As + * @q is bypassing at this point, blkg_lookup_create() can't be + * used. Open code insertion. + */ + rcu_read_lock(); + spin_lock_irq(q->queue_lock); + blkg = blkg_create(&blkcg_root, q, new_blkg); + spin_unlock_irq(q->queue_lock); + rcu_read_unlock(); + + if (preloaded) + radix_tree_preload_end(); + + if (IS_ERR(blkg)) { + kfree(new_blkg); + return PTR_ERR(blkg); + } - return blk_throtl_init(q); + q->root_blkg = blkg; + q->root_rl.blkg = blkg; + + ret = blk_throtl_init(q); + if (ret) { + spin_lock_irq(q->queue_lock); + blkg_destroy_all(q); + spin_unlock_irq(q->queue_lock); + } + return ret; } /** @@ -962,52 +1004,20 @@ int blkcg_activate_policy(struct request_queue *q, const struct blkcg_policy *pol) { LIST_HEAD(pds); - struct blkcg_gq *blkg, *new_blkg; + struct blkcg_gq *blkg; struct blkg_policy_data *pd, *n; int cnt = 0, ret; - bool preloaded; if (blkcg_policy_enabled(q, pol)) return 0; - /* preallocations for root blkg */ - new_blkg = blkg_alloc(&blkcg_root, q, GFP_KERNEL); - if (!new_blkg) - return -ENOMEM; - + /* count and allocate policy_data for all existing blkgs */ blk_queue_bypass_start(q); - - preloaded = !radix_tree_preload(GFP_KERNEL); - - /* - * Make sure the root blkg exists and count the existing blkgs. As - * @q is bypassing at this point, blkg_lookup_create() can't be - * used. Open code it. - */ spin_lock_irq(q->queue_lock); - - rcu_read_lock(); - blkg = __blkg_lookup(&blkcg_root, q, false); - if (blkg) - blkg_free(new_blkg); - else - blkg = blkg_create(&blkcg_root, q, new_blkg); - rcu_read_unlock(); - - if (preloaded) - radix_tree_preload_end(); - - if (IS_ERR(blkg)) { - ret = PTR_ERR(blkg); - goto out_unlock; - } - list_for_each_entry(blkg, &q->blkg_list, q_node) cnt++; - spin_unlock_irq(q->queue_lock); - /* allocate policy_data for all existing blkgs */ while (cnt--) { pd = kzalloc_node(pol->pd_size, GFP_KERNEL, q->node); if (!pd) { @@ -1076,10 +1086,6 @@ void blkcg_deactivate_policy(struct request_queue *q, __clear_bit(pol->plid, q->blkcg_pols); - /* if no policy is left, no need for blkgs - shoot them down */ - if (bitmap_empty(q->blkcg_pols, BLKCG_MAX_POLS)) - blkg_destroy_all(q); - list_for_each_entry(blkg, &q->blkg_list, q_node) { /* grab blkcg lock too while removing @pd from @blkg */ spin_lock(&blkg->blkcg->lock); diff --git a/block/blk-core.c b/block/blk-core.c index 03b5f8d77f37..3c515a4cd960 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -32,12 +32,12 @@ #include <linux/delay.h> #include <linux/ratelimit.h> #include <linux/pm_runtime.h> +#include <linux/blk-cgroup.h> #define CREATE_TRACE_POINTS #include <trace/events/block.h> #include "blk.h" -#include "blk-cgroup.h" #include "blk-mq.h" EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); @@ -63,6 +63,31 @@ struct kmem_cache *blk_requestq_cachep; */ static struct workqueue_struct *kblockd_workqueue; +static void blk_clear_congested(struct request_list *rl, int sync) +{ +#ifdef CONFIG_CGROUP_WRITEBACK + clear_wb_congested(rl->blkg->wb_congested, sync); +#else + /* + * If !CGROUP_WRITEBACK, all blkg's map to bdi->wb and we shouldn't + * flip its congestion state for events on other blkcgs. + */ + if (rl == &rl->q->root_rl) + clear_wb_congested(rl->q->backing_dev_info.wb.congested, sync); +#endif +} + +static void blk_set_congested(struct request_list *rl, int sync) +{ +#ifdef CONFIG_CGROUP_WRITEBACK + set_wb_congested(rl->blkg->wb_congested, sync); +#else + /* see blk_clear_congested() */ + if (rl == &rl->q->root_rl) + set_wb_congested(rl->q->backing_dev_info.wb.congested, sync); +#endif +} + void blk_queue_congestion_threshold(struct request_queue *q) { int nr; @@ -621,8 +646,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id) q->backing_dev_info.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE; - q->backing_dev_info.state = 0; - q->backing_dev_info.capabilities = 0; + q->backing_dev_info.capabilities = BDI_CAP_CGROUP_WRITEBACK; q->backing_dev_info.name = "block"; q->node = node_id; @@ -845,13 +869,8 @@ static void __freed_request(struct request_list *rl, int sync) { struct request_queue *q = rl->q; - /* - * bdi isn't aware of blkcg yet. As all async IOs end up root - * blkcg anyway, just use root blkcg state. - */ - if (rl == &q->root_rl && - rl->count[sync] < queue_congestion_off_threshold(q)) - blk_clear_queue_congested(q, sync); + if (rl->count[sync] < queue_congestion_off_threshold(q)) + blk_clear_congested(rl, sync); if (rl->count[sync] + 1 <= q->nr_requests) { if (waitqueue_active(&rl->wait[sync])) @@ -884,25 +903,25 @@ static void freed_request(struct request_list *rl, unsigned int flags) int blk_update_nr_requests(struct request_queue *q, unsigned int nr) { struct request_list *rl; + int on_thresh, off_thresh; spin_lock_irq(q->queue_lock); q->nr_requests = nr; blk_queue_congestion_threshold(q); + on_thresh = queue_congestion_on_threshold(q); + off_thresh = queue_congestion_off_threshold(q); - /* congestion isn't cgroup aware and follows root blkcg for now */ - rl = &q->root_rl; - - if (rl->count[BLK_RW_SYNC] >= queue_congestion_on_threshold(q)) - blk_set_queue_congested(q, BLK_RW_SYNC); - else if (rl->count[BLK_RW_SYNC] < queue_congestion_off_threshold(q)) - blk_clear_queue_congested(q, BLK_RW_SYNC); + blk_queue_for_each_rl(rl, q) { + if (rl->count[BLK_RW_SYNC] >= on_thresh) + blk_set_congested(rl, BLK_RW_SYNC); + else if (rl->count[BLK_RW_SYNC] < off_thresh) + blk_clear_congested(rl, BLK_RW_SYNC); - if (rl->count[BLK_RW_ASYNC] >= queue_congestion_on_threshold(q)) - blk_set_queue_congested(q, BLK_RW_ASYNC); - else if (rl->count[BLK_RW_ASYNC] < queue_congestion_off_threshold(q)) - blk_clear_queue_congested(q, BLK_RW_ASYNC); + if (rl->count[BLK_RW_ASYNC] >= on_thresh) + blk_set_congested(rl, BLK_RW_ASYNC); + else if (rl->count[BLK_RW_ASYNC] < off_thresh) + blk_clear_congested(rl, BLK_RW_ASYNC); - blk_queue_for_each_rl(rl, q) { if (rl->count[BLK_RW_SYNC] >= q->nr_requests) { blk_set_rl_full(rl, BLK_RW_SYNC); } else { @@ -1012,12 +1031,7 @@ static struct request *__get_request(struct request_list *rl, int rw_flags, } } } - /* - * bdi isn't aware of blkcg yet. As all async IOs end up - * root blkcg anyway, just use root blkcg state. - */ - if (rl == &q->root_rl) - blk_set_queue_congested(q, is_sync); + blk_set_congested(rl, is_sync); } /* diff --git a/block/blk-integrity.c b/block/blk-integrity.c index 79ffb4855af0..f548b64be092 100644 --- a/block/blk-integrity.c +++ b/block/blk-integrity.c @@ -21,6 +21,7 @@ */ #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/mempool.h> #include <linux/bio.h> #include <linux/scatterlist.h> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index 2b8fd302f677..6264b382d4d1 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -6,11 +6,12 @@ #include <linux/module.h> #include <linux/bio.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/blktrace_api.h> #include <linux/blk-mq.h> +#include <linux/blk-cgroup.h> #include "blk.h" -#include "blk-cgroup.h" #include "blk-mq.h" struct queue_sysfs_entry { diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 5b9c6d5c3636..b23193518ac7 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -9,7 +9,7 @@ #include <linux/blkdev.h> #include <linux/bio.h> #include <linux/blktrace_api.h> -#include "blk-cgroup.h" +#include <linux/blk-cgroup.h> #include "blk.h" /* Max dispatch from a group in 1 round */ diff --git a/block/bounce.c b/block/bounce.c index ed9dd8067120..b17311227c12 100644 --- a/block/bounce.c +++ b/block/bounce.c @@ -13,6 +13,7 @@ #include <linux/pagemap.h> #include <linux/mempool.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/init.h> #include <linux/hash.h> #include <linux/highmem.h> @@ -128,9 +129,6 @@ static void bounce_end_io(struct bio *bio, mempool_t *pool, int err) struct bio_vec *bvec, *org_vec; int i; - if (test_bit(BIO_EOPNOTSUPP, &bio->bi_flags)) - set_bit(BIO_EOPNOTSUPP, &bio_orig->bi_flags); - /* * free up bounce indirect pages used */ diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c index 5da8e6e9ab4b..bc8f42930773 100644 --- a/block/cfq-iosched.c +++ b/block/cfq-iosched.c @@ -14,8 +14,8 @@ #include <linux/rbtree.h> #include <linux/ioprio.h> #include <linux/blktrace_api.h> +#include <linux/blk-cgroup.h> #include "blk.h" -#include "blk-cgroup.h" /* * tunables diff --git a/block/elevator.c b/block/elevator.c index 8985038f398c..e321bd55ddfe 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -35,11 +35,11 @@ #include <linux/hash.h> #include <linux/uaccess.h> #include <linux/pm_runtime.h> +#include <linux/blk-cgroup.h> #include <trace/events/block.h> #include "blk.h" -#include "blk-cgroup.h" static DEFINE_SPINLOCK(elv_list_lock); static LIST_HEAD(elv_list); diff --git a/block/genhd.c b/block/genhd.c index ea982eadaf63..59a1395eedac 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -8,6 +8,7 @@ #include <linux/kdev_t.h> #include <linux/kernel.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/init.h> #include <linux/spinlock.h> #include <linux/proc_fs.h> diff --git a/drivers/acpi/thermal.c b/drivers/acpi/thermal.c index d24fa1964eb8..6d4e44ea74ac 100644 --- a/drivers/acpi/thermal.c +++ b/drivers/acpi/thermal.c @@ -800,7 +800,8 @@ static int acpi_thermal_cooling_device_cb(struct thermal_zone_device *thermal, result = thermal_zone_bind_cooling_device (thermal, trip, cdev, - THERMAL_NO_LIMIT, THERMAL_NO_LIMIT); + THERMAL_NO_LIMIT, THERMAL_NO_LIMIT, + THERMAL_WEIGHT_DEFAULT); else result = thermal_zone_unbind_cooling_device @@ -824,7 +825,8 @@ static int acpi_thermal_cooling_device_cb(struct thermal_zone_device *thermal, if (bind) result = thermal_zone_bind_cooling_device (thermal, trip, cdev, - THERMAL_NO_LIMIT, THERMAL_NO_LIMIT); + THERMAL_NO_LIMIT, THERMAL_NO_LIMIT, + THERMAL_WEIGHT_DEFAULT); else result = thermal_zone_unbind_cooling_device (thermal, trip, cdev); @@ -841,7 +843,8 @@ static int acpi_thermal_cooling_device_cb(struct thermal_zone_device *thermal, result = thermal_zone_bind_cooling_device (thermal, THERMAL_TRIPS_NONE, cdev, THERMAL_NO_LIMIT, - THERMAL_NO_LIMIT); + THERMAL_NO_LIMIT, + THERMAL_WEIGHT_DEFAULT); else result = thermal_zone_unbind_cooling_device (thermal, THERMAL_TRIPS_NONE, diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h index b905e9888b88..efd19c2da9c2 100644 --- a/drivers/block/drbd/drbd_int.h +++ b/drivers/block/drbd/drbd_int.h @@ -38,6 +38,7 @@ #include <linux/mutex.h> #include <linux/major.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/genhd.h> #include <linux/idr.h> #include <net/tcp.h> diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c index 81fde9ef7f8e..a1518539b858 100644 --- a/drivers/block/drbd/drbd_main.c +++ b/drivers/block/drbd/drbd_main.c @@ -2359,7 +2359,7 @@ static void drbd_cleanup(void) * @congested_data: User data * @bdi_bits: Bits the BDI flusher thread is currently interested in * - * Returns 1<<BDI_async_congested and/or 1<<BDI_sync_congested if we are congested. + * Returns 1<<WB_async_congested and/or 1<<WB_sync_congested if we are congested. */ static int drbd_congested(void *congested_data, int bdi_bits) { @@ -2376,14 +2376,14 @@ static int drbd_congested(void *congested_data, int bdi_bits) } if (test_bit(CALLBACK_PENDING, &first_peer_device(device)->connection->flags)) { - r |= (1 << BDI_async_congested); + r |= (1 << WB_async_congested); /* Without good local data, we would need to read from remote, * and that would need the worker thread as well, which is * currently blocked waiting for that usermode helper to * finish. */ if (!get_ldev_if_state(device, D_UP_TO_DATE)) - r |= (1 << BDI_sync_congested); + r |= (1 << WB_sync_congested); else put_ldev(device); r &= bdi_bits; @@ -2399,9 +2399,9 @@ static int drbd_congested(void *congested_data, int bdi_bits) reason = 'b'; } - if (bdi_bits & (1 << BDI_async_congested) && + if (bdi_bits & (1 << WB_async_congested) && test_bit(NET_CONGESTED, &first_peer_device(device)->connection->flags)) { - r |= (1 << BDI_async_congested); + r |= (1 << WB_async_congested); reason = reason == 'b' ? 'a' : 'n'; } diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c index 09e628dafd9d..4c20c228184c 100644 --- a/drivers/block/pktcdvd.c +++ b/drivers/block/pktcdvd.c @@ -61,6 +61,7 @@ #include <linux/freezer.h> #include <linux/mutex.h> #include <linux/slab.h> +#include <linux/backing-dev.h> #include <scsi/scsi_cmnd.h> #include <scsi/scsi_ioctl.h> #include <scsi/scsi.h> diff --git a/drivers/char/raw.c b/drivers/char/raw.c index 5fc291c6157e..60316fbaf295 100644 --- a/drivers/char/raw.c +++ b/drivers/char/raw.c @@ -12,6 +12,7 @@ #include <linux/fs.h> #include <linux/major.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/module.h> #include <linux/raw.h> #include <linux/capability.h> diff --git a/drivers/cpuidle/cpuidle-calxeda.c b/drivers/cpuidle/cpuidle-calxeda.c index 9445e6cc02be..1c3242c3e3a7 100644 --- a/drivers/cpuidle/cpuidle-calxeda.c +++ b/drivers/cpuidle/cpuidle-calxeda.c @@ -25,16 +25,21 @@ #include <linux/init.h> #include <linux/mm.h> #include <linux/platform_device.h> +#include <linux/psci.h> + #include <asm/cpuidle.h> #include <asm/suspend.h> -#include <asm/psci.h> + +#include <uapi/linux/psci.h> + +#define CALXEDA_IDLE_PARAM \ + ((0 << PSCI_0_2_POWER_STATE_ID_SHIFT) | \ + (0 << PSCI_0_2_POWER_STATE_AFFL_SHIFT) | \ + (PSCI_POWER_STATE_TYPE_POWER_DOWN << PSCI_0_2_POWER_STATE_TYPE_SHIFT)) static int calxeda_idle_finish(unsigned long val) { - const struct psci_power_state ps = { - .type = PSCI_POWER_STATE_TYPE_POWER_DOWN, - }; - return psci_ops.cpu_suspend(ps, __pa(cpu_resume)); + return psci_ops.cpu_suspend(CALXEDA_IDLE_PARAM, __pa(cpu_resume)); } static int calxeda_pwrdown_idle(struct cpuidle_device *dev, diff --git a/drivers/firmware/Kconfig b/drivers/firmware/Kconfig index 6517132e5d8b..24be7a34b84a 100644 --- a/drivers/firmware/Kconfig +++ b/drivers/firmware/Kconfig @@ -5,6 +5,9 @@ menu "Firmware Drivers" +config ARM_PSCI_FW + bool + config EDD tristate "BIOS Enhanced Disk Drive calls determine boot disk" depends on X86 diff --git a/drivers/firmware/Makefile b/drivers/firmware/Makefile index 3fdd3912709a..6b051293e689 100644 --- a/drivers/firmware/Makefile +++ b/drivers/firmware/Makefile @@ -1,6 +1,7 @@ # # Makefile for the linux kernel. # +obj-$(CONFIG_ARM_PSCI_FW) += psci.o obj-$(CONFIG_DMI) += dmi_scan.o obj-$(CONFIG_DMI_SYSFS) += dmi-sysfs.o obj-$(CONFIG_EDD) += edd.o diff --git a/drivers/firmware/psci.c b/drivers/firmware/psci.c new file mode 100644 index 000000000000..ae70d2485ca1 --- /dev/null +++ b/drivers/firmware/psci.c @@ -0,0 +1,470 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * Copyright (C) 2015 ARM Limited + */ + +#define pr_fmt(fmt) "psci: " fmt + +#include <linux/errno.h> +#include <linux/linkage.h> +#include <linux/of.h> +#include <linux/pm.h> +#include <linux/printk.h> +#include <linux/psci.h> +#include <linux/reboot.h> +#include <linux/suspend.h> + +#include <uapi/linux/psci.h> + +#include <asm/cputype.h> +#include <asm/system_misc.h> +#include <asm/smp_plat.h> +#include <asm/suspend.h> + +/* + * While a 64-bit OS can make calls with SMC32 calling conventions, for some + * calls it is necessary to use SMC64 to pass or return 64-bit values. + * For such calls PSCI_FN_NATIVE(version, name) will choose the appropriate + * (native-width) function ID. + */ +#ifdef CONFIG_64BIT +#define PSCI_FN_NATIVE(version, name) PSCI_##version##_FN64_##name +#else +#define PSCI_FN_NATIVE(version, name) PSCI_##version##_FN_##name +#endif + +/* + * The CPU any Trusted OS is resident on. The trusted OS may reject CPU_OFF + * calls to its resident CPU, so we must avoid issuing those. We never migrate + * a Trusted OS even if it claims to be capable of migration -- doing so will + * require cooperation with a Trusted OS driver. + */ +static int resident_cpu = -1; + +bool psci_tos_resident_on(int cpu) +{ + return cpu == resident_cpu; +} + +struct psci_operations psci_ops; + +typedef unsigned long (psci_fn)(unsigned long, unsigned long, + unsigned long, unsigned long); +asmlinkage psci_fn __invoke_psci_fn_hvc; +asmlinkage psci_fn __invoke_psci_fn_smc; +static psci_fn *invoke_psci_fn; + +enum psci_function { + PSCI_FN_CPU_SUSPEND, + PSCI_FN_CPU_ON, + PSCI_FN_CPU_OFF, + PSCI_FN_MIGRATE, + PSCI_FN_MAX, +}; + +static u32 psci_function_id[PSCI_FN_MAX]; + +#define PSCI_0_2_POWER_STATE_MASK \ + (PSCI_0_2_POWER_STATE_ID_MASK | \ + PSCI_0_2_POWER_STATE_TYPE_MASK | \ + PSCI_0_2_POWER_STATE_AFFL_MASK) + +#define PSCI_1_0_EXT_POWER_STATE_MASK \ + (PSCI_1_0_EXT_POWER_STATE_ID_MASK | \ + PSCI_1_0_EXT_POWER_STATE_TYPE_MASK) + +static u32 psci_cpu_suspend_feature; + +static inline bool psci_has_ext_power_state(void) +{ + return psci_cpu_suspend_feature & + PSCI_1_0_FEATURES_CPU_SUSPEND_PF_MASK; +} + +bool psci_power_state_loses_context(u32 state) +{ + const u32 mask = psci_has_ext_power_state() ? + PSCI_1_0_EXT_POWER_STATE_TYPE_MASK : + PSCI_0_2_POWER_STATE_TYPE_MASK; + + return state & mask; +} + +bool psci_power_state_is_valid(u32 state) +{ + const u32 valid_mask = psci_has_ext_power_state() ? + PSCI_1_0_EXT_POWER_STATE_MASK : + PSCI_0_2_POWER_STATE_MASK; + + return !(state & ~valid_mask); +} + +static int psci_to_linux_errno(int errno) +{ + switch (errno) { + case PSCI_RET_SUCCESS: + return 0; + case PSCI_RET_NOT_SUPPORTED: + return -EOPNOTSUPP; + case PSCI_RET_INVALID_PARAMS: + case PSCI_RET_INVALID_ADDRESS: + return -EINVAL; + case PSCI_RET_DENIED: + return -EPERM; + }; + + return -EINVAL; +} + +static u32 psci_get_version(void) +{ + return invoke_psci_fn(PSCI_0_2_FN_PSCI_VERSION, 0, 0, 0); +} + +static int psci_cpu_suspend(u32 state, unsigned long entry_point) +{ + int err; + u32 fn; + + fn = psci_function_id[PSCI_FN_CPU_SUSPEND]; + err = invoke_psci_fn(fn, state, entry_point, 0); + return psci_to_linux_errno(err); +} + +static int psci_cpu_off(u32 state) +{ + int err; + u32 fn; + + fn = psci_function_id[PSCI_FN_CPU_OFF]; + err = invoke_psci_fn(fn, state, 0, 0); + return psci_to_linux_errno(err); +} + +static int psci_cpu_on(unsigned long cpuid, unsigned long entry_point) +{ + int err; + u32 fn; + + fn = psci_function_id[PSCI_FN_CPU_ON]; + err = invoke_psci_fn(fn, cpuid, entry_point, 0); + return psci_to_linux_errno(err); +} + +static int psci_migrate(unsigned long cpuid) +{ + int err; + u32 fn; + + fn = psci_function_id[PSCI_FN_MIGRATE]; + err = invoke_psci_fn(fn, cpuid, 0, 0); + return psci_to_linux_errno(err); +} + +static int psci_affinity_info(unsigned long target_affinity, + unsigned long lowest_affinity_level) +{ + return invoke_psci_fn(PSCI_FN_NATIVE(0_2, AFFINITY_INFO), + target_affinity, lowest_affinity_level, 0); +} + +static int psci_migrate_info_type(void) +{ + return invoke_psci_fn(PSCI_0_2_FN_MIGRATE_INFO_TYPE, 0, 0, 0); +} + +static unsigned long psci_migrate_info_up_cpu(void) +{ + return invoke_psci_fn(PSCI_FN_NATIVE(0_2, MIGRATE_INFO_UP_CPU), + 0, 0, 0); +} + +static int get_set_conduit_method(struct device_node *np) +{ + const char *method; + + pr_info("probing for conduit method from DT.\n"); + + if (of_property_read_string(np, "method", &method)) { + pr_warn("missing \"method\" property\n"); + return -ENXIO; + } + + if (!strcmp("hvc", method)) { + invoke_psci_fn = __invoke_psci_fn_hvc; + } else if (!strcmp("smc", method)) { + invoke_psci_fn = __invoke_psci_fn_smc; + } else { + pr_warn("invalid \"method\" property: %s\n", method); + return -EINVAL; + } + return 0; +} + +static void psci_sys_reset(enum reboot_mode reboot_mode, const char *cmd) +{ + invoke_psci_fn(PSCI_0_2_FN_SYSTEM_RESET, 0, 0, 0); +} + +static void psci_sys_poweroff(void) +{ + invoke_psci_fn(PSCI_0_2_FN_SYSTEM_OFF, 0, 0, 0); +} + +static int __init psci_features(u32 psci_func_id) +{ + return invoke_psci_fn(PSCI_1_0_FN_PSCI_FEATURES, + psci_func_id, 0, 0); +} + +static int psci_system_suspend(unsigned long unused) +{ + return invoke_psci_fn(PSCI_FN_NATIVE(1_0, SYSTEM_SUSPEND), + virt_to_phys(cpu_resume), 0, 0); +} + +static int psci_system_suspend_enter(suspend_state_t state) +{ + return cpu_suspend(0, psci_system_suspend); +} + +static const struct platform_suspend_ops psci_suspend_ops = { + .valid = suspend_valid_only_mem, + .enter = psci_system_suspend_enter, +}; + +static void __init psci_init_system_suspend(void) +{ + int ret; + + if (!IS_ENABLED(CONFIG_SUSPEND)) + return; + + ret = psci_features(PSCI_FN_NATIVE(1_0, SYSTEM_SUSPEND)); + + if (ret != PSCI_RET_NOT_SUPPORTED) + suspend_set_ops(&psci_suspend_ops); +} + +static void __init psci_init_cpu_suspend(void) +{ + int feature = psci_features(psci_function_id[PSCI_FN_CPU_SUSPEND]); + + if (feature != PSCI_RET_NOT_SUPPORTED) + psci_cpu_suspend_feature = feature; +} + +/* + * Detect the presence of a resident Trusted OS which may cause CPU_OFF to + * return DENIED (which would be fatal). + */ +static void __init psci_init_migrate(void) +{ + unsigned long cpuid; + int type, cpu = -1; + + type = psci_ops.migrate_info_type(); + + if (type == PSCI_0_2_TOS_MP) { + pr_info("Trusted OS migration not required\n"); + return; + } + + if (type == PSCI_RET_NOT_SUPPORTED) { + pr_info("MIGRATE_INFO_TYPE not supported.\n"); + return; + } + + if (type != PSCI_0_2_TOS_UP_MIGRATE && + type != PSCI_0_2_TOS_UP_NO_MIGRATE) { + pr_err("MIGRATE_INFO_TYPE returned unknown type (%d)\n", type); + return; + } + + cpuid = psci_migrate_info_up_cpu(); + if (cpuid & ~MPIDR_HWID_BITMASK) { + pr_warn("MIGRATE_INFO_UP_CPU reported invalid physical ID (0x%lx)\n", + cpuid); + return; + } + + cpu = get_logical_index(cpuid); + resident_cpu = cpu >= 0 ? cpu : -1; + + pr_info("Trusted OS resident on physical CPU 0x%lx\n", cpuid); +} + +static void __init psci_0_2_set_functions(void) +{ + pr_info("Using standard PSCI v0.2 function IDs\n"); + psci_function_id[PSCI_FN_CPU_SUSPEND] = + PSCI_FN_NATIVE(0_2, CPU_SUSPEND); + psci_ops.cpu_suspend = psci_cpu_suspend; + + psci_function_id[PSCI_FN_CPU_OFF] = PSCI_0_2_FN_CPU_OFF; + psci_ops.cpu_off = psci_cpu_off; + + psci_function_id[PSCI_FN_CPU_ON] = PSCI_FN_NATIVE(0_2, CPU_ON); + psci_ops.cpu_on = psci_cpu_on; + + psci_function_id[PSCI_FN_MIGRATE] = PSCI_FN_NATIVE(0_2, MIGRATE); + psci_ops.migrate = psci_migrate; + + psci_ops.affinity_info = psci_affinity_info; + + psci_ops.migrate_info_type = psci_migrate_info_type; + + arm_pm_restart = psci_sys_reset; + + pm_power_off = psci_sys_poweroff; +} + +/* + * Probe function for PSCI firmware versions >= 0.2 + */ +static int __init psci_probe(void) +{ + u32 ver = psci_get_version(); + + pr_info("PSCIv%d.%d detected in firmware.\n", + PSCI_VERSION_MAJOR(ver), + PSCI_VERSION_MINOR(ver)); + + if (PSCI_VERSION_MAJOR(ver) == 0 && PSCI_VERSION_MINOR(ver) < 2) { + pr_err("Conflicting PSCI version detected.\n"); + return -EINVAL; + } + + psci_0_2_set_functions(); + + psci_init_migrate(); + + if (PSCI_VERSION_MAJOR(ver) >= 1) { + psci_init_cpu_suspend(); + psci_init_system_suspend(); + } + + return 0; +} + +typedef int (*psci_initcall_t)(const struct device_node *); + +/* + * PSCI init function for PSCI versions >=0.2 + * + * Probe based on PSCI PSCI_VERSION function + */ +static int __init psci_0_2_init(struct device_node *np) +{ + int err; + + err = get_set_conduit_method(np); + + if (err) + goto out_put_node; + /* + * Starting with v0.2, the PSCI specification introduced a call + * (PSCI_VERSION) that allows probing the firmware version, so + * that PSCI function IDs and version specific initialization + * can be carried out according to the specific version reported + * by firmware + */ + err = psci_probe(); + +out_put_node: + of_node_put(np); + return err; +} + +/* + * PSCI < v0.2 get PSCI Function IDs via DT. + */ +static int __init psci_0_1_init(struct device_node *np) +{ + u32 id; + int err; + + err = get_set_conduit_method(np); + + if (err) + goto out_put_node; + + pr_info("Using PSCI v0.1 Function IDs from DT\n"); + + if (!of_property_read_u32(np, "cpu_suspend", &id)) { + psci_function_id[PSCI_FN_CPU_SUSPEND] = id; + psci_ops.cpu_suspend = psci_cpu_suspend; + } + + if (!of_property_read_u32(np, "cpu_off", &id)) { + psci_function_id[PSCI_FN_CPU_OFF] = id; + psci_ops.cpu_off = psci_cpu_off; + } + + if (!of_property_read_u32(np, "cpu_on", &id)) { + psci_function_id[PSCI_FN_CPU_ON] = id; + psci_ops.cpu_on = psci_cpu_on; + } + + if (!of_property_read_u32(np, "migrate", &id)) { + psci_function_id[PSCI_FN_MIGRATE] = id; + psci_ops.migrate = psci_migrate; + } + +out_put_node: + of_node_put(np); + return err; +} + +static const struct of_device_id psci_of_match[] __initconst = { + { .compatible = "arm,psci", .data = psci_0_1_init}, + { .compatible = "arm,psci-0.2", .data = psci_0_2_init}, + { .compatible = "arm,psci-1.0", .data = psci_0_2_init}, + {}, +}; + +int __init psci_dt_init(void) +{ + struct device_node *np; + const struct of_device_id *matched_np; + psci_initcall_t init_fn; + + np = of_find_matching_node_and_match(NULL, psci_of_match, &matched_np); + + if (!np) + return -ENODEV; + + init_fn = (psci_initcall_t)matched_np->data; + return init_fn(np); +} + +#ifdef CONFIG_ACPI +/* + * We use PSCI 0.2+ when ACPI is deployed on ARM64 and it's + * explicitly clarified in SBBR + */ +int __init psci_acpi_init(void) +{ + if (!acpi_psci_present()) { + pr_info("is not implemented in ACPI.\n"); + return -EOPNOTSUPP; + } + + pr_info("probing for conduit method from ACPI.\n"); + + if (acpi_psci_use_hvc()) + invoke_psci_fn = __invoke_psci_fn_hvc; + else + invoke_psci_fn = __invoke_psci_fn_smc; + + return psci_probe(); +} +#endif diff --git a/drivers/irqchip/Makefile b/drivers/irqchip/Makefile index dda4927e47a6..801bd75f7266 100644 --- a/drivers/irqchip/Makefile +++ b/drivers/irqchip/Makefile @@ -22,7 +22,7 @@ obj-$(CONFIG_ARCH_SPEAR3XX) += spear-shirq.o obj-$(CONFIG_ARM_GIC) += irq-gic.o irq-gic-common.o obj-$(CONFIG_ARM_GIC_V2M) += irq-gic-v2m.o obj-$(CONFIG_ARM_GIC_V3) += irq-gic-v3.o irq-gic-common.o -obj-$(CONFIG_ARM_GIC_V3_ITS) += irq-gic-v3-its.o +obj-$(CONFIG_ARM_GIC_V3_ITS) += irq-gic-v3-its.o irq-gic-v3-its-pci-msi.o obj-$(CONFIG_ARM_NVIC) += irq-nvic.o obj-$(CONFIG_ARM_VIC) += irq-vic.o obj-$(CONFIG_ATMEL_AIC_IRQ) += irq-atmel-aic-common.o irq-atmel-aic.o diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c index fdf706555d72..ec9c37674a0b 100644 --- a/drivers/irqchip/irq-gic-v2m.c +++ b/drivers/irqchip/irq-gic-v2m.c @@ -45,7 +45,6 @@ struct v2m_data { spinlock_t msi_cnt_lock; - struct msi_controller mchip; struct resource res; /* GICv2m resource */ void __iomem *base; /* GICv2m virt address */ u32 spi_start; /* The SPI number that MSIs start */ @@ -218,6 +217,7 @@ static int __init gicv2m_init_one(struct device_node *node, { int ret; struct v2m_data *v2m; + struct irq_domain *inner_domain; v2m = kzalloc(sizeof(struct v2m_data), GFP_KERNEL); if (!v2m) { @@ -261,19 +261,18 @@ static int __init gicv2m_init_one(struct device_node *node, goto err_iounmap; } - v2m->domain = irq_domain_add_tree(NULL, &gicv2m_domain_ops, v2m); - if (!v2m->domain) { + inner_domain = irq_domain_add_tree(node, &gicv2m_domain_ops, v2m); + if (!inner_domain) { pr_err("Failed to create GICv2m domain\n"); ret = -ENOMEM; goto err_free_bm; } - v2m->domain->parent = parent; - v2m->mchip.of_node = node; - v2m->mchip.domain = pci_msi_create_irq_domain(node, - &gicv2m_msi_domain_info, - v2m->domain); - if (!v2m->mchip.domain) { + inner_domain->bus_token = DOMAIN_BUS_NEXUS; + inner_domain->parent = parent; + v2m->domain = pci_msi_create_irq_domain(node, &gicv2m_msi_domain_info, + inner_domain); + if (!v2m->domain) { pr_err("Failed to create MSI domain\n"); ret = -ENOMEM; goto err_free_domains; @@ -281,12 +280,6 @@ static int __init gicv2m_init_one(struct device_node *node, spin_lock_init(&v2m->msi_cnt_lock); - ret = of_pci_msi_chip_add(&v2m->mchip); - if (ret) { - pr_err("Failed to add msi_chip.\n"); - goto err_free_domains; - } - pr_info("Node %s: range[%#lx:%#lx], SPI[%d:%d]\n", node->name, (unsigned long)v2m->res.start, (unsigned long)v2m->res.end, v2m->spi_start, (v2m->spi_start + v2m->nr_spis)); @@ -294,10 +287,10 @@ static int __init gicv2m_init_one(struct device_node *node, return 0; err_free_domains: - if (v2m->mchip.domain) - irq_domain_remove(v2m->mchip.domain); if (v2m->domain) irq_domain_remove(v2m->domain); + if (inner_domain) + irq_domain_remove(inner_domain); err_free_bm: kfree(v2m->bm); err_iounmap: diff --git a/drivers/irqchip/irq-gic-v3-its-pci-msi.c b/drivers/irqchip/irq-gic-v3-its-pci-msi.c new file mode 100644 index 000000000000..cf351c637464 --- /dev/null +++ b/drivers/irqchip/irq-gic-v3-its-pci-msi.c @@ -0,0 +1,140 @@ +/* + * Copyright (C) 2013-2015 ARM Limited, All Rights Reserved. + * Author: Marc Zyngier <marc.zyngier@arm.com> + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program. If not, see <http://www.gnu.org/licenses/>. + */ + +#include <linux/msi.h> +#include <linux/of.h> +#include <linux/of_irq.h> +#include <linux/of_pci.h> + +static void its_mask_msi_irq(struct irq_data *d) +{ + pci_msi_mask_irq(d); + irq_chip_mask_parent(d); +} + +static void its_unmask_msi_irq(struct irq_data *d) +{ + pci_msi_unmask_irq(d); + irq_chip_unmask_parent(d); +} + +static struct irq_chip its_msi_irq_chip = { + .name = "ITS-MSI", + .irq_unmask = its_unmask_msi_irq, + .irq_mask = its_mask_msi_irq, + .irq_eoi = irq_chip_eoi_parent, + .irq_write_msi_msg = pci_msi_domain_write_msg, +}; + +struct its_pci_alias { + struct pci_dev *pdev; + u32 dev_id; + u32 count; +}; + +static int its_pci_msi_vec_count(struct pci_dev *pdev) +{ + int msi, msix; + + msi = max(pci_msi_vec_count(pdev), 0); + msix = max(pci_msix_vec_count(pdev), 0); + + return max(msi, msix); +} + +static int its_get_pci_alias(struct pci_dev *pdev, u16 alias, void *data) +{ + struct its_pci_alias *dev_alias = data; + + dev_alias->dev_id = alias; + if (pdev != dev_alias->pdev) + dev_alias->count += its_pci_msi_vec_count(dev_alias->pdev); + + return 0; +} + +static int its_pci_msi_prepare(struct irq_domain *domain, struct device *dev, + int nvec, msi_alloc_info_t *info) +{ + struct pci_dev *pdev; + struct its_pci_alias dev_alias; + struct msi_domain_info *msi_info; + + if (!dev_is_pci(dev)) + return -EINVAL; + + msi_info = msi_get_domain_info(domain->parent); + + pdev = to_pci_dev(dev); + dev_alias.pdev = pdev; + dev_alias.count = nvec; + + pci_for_each_dma_alias(pdev, its_get_pci_alias, &dev_alias); + + /* ITS specific DeviceID, as the core ITS ignores dev. */ + info->scratchpad[0].ul = dev_alias.dev_id; + + return msi_info->ops->msi_prepare(domain->parent, + dev, dev_alias.count, info); +} + +static struct msi_domain_ops its_pci_msi_ops = { + .msi_prepare = its_pci_msi_prepare, +}; + +static struct msi_domain_info its_pci_msi_domain_info = { + .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS | + MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX), + .ops = &its_pci_msi_ops, + .chip = &its_msi_irq_chip, +}; + +static struct of_device_id its_device_id[] = { + { .compatible = "arm,gic-v3-its", }, + {}, +}; + +static int __init its_pci_msi_init(void) +{ + struct device_node *np; + struct irq_domain *parent; + + for (np = of_find_matching_node(NULL, its_device_id); np; + np = of_find_matching_node(np, its_device_id)) { + if (!of_property_read_bool(np, "msi-controller")) + continue; + + parent = irq_find_matching_host(np, DOMAIN_BUS_NEXUS); + if (!parent || !msi_get_domain_info(parent)) { + pr_err("%s: unable to locate ITS domain\n", + np->full_name); + continue; + } + + if (!pci_msi_create_irq_domain(np, &its_pci_msi_domain_info, + parent)) { + pr_err("%s: unable to create PCI domain\n", + np->full_name); + continue; + } + + pr_info("PCI/MSI: %s domain created\n", np->full_name); + } + + return 0; +} +early_initcall(its_pci_msi_init); diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c index 9a791dd52199..78db25293384 100644 --- a/drivers/irqchip/irq-gic-v3-its.c +++ b/drivers/irqchip/irq-gic-v3-its.c @@ -54,14 +54,12 @@ struct its_collection { /* * The ITS structure - contains most of the infrastructure, with the - * msi_controller, the command queue, the collections, and the list of - * devices writing to it. + * top-level MSI domain, the command queue, the collections, and the + * list of devices writing to it. */ struct its_node { raw_spinlock_t lock; struct list_head entry; - struct msi_controller msi_chip; - struct irq_domain *domain; void __iomem *base; unsigned long phys_base; struct its_cmd_block *cmd_base; @@ -643,26 +641,6 @@ static struct irq_chip its_irq_chip = { .irq_compose_msi_msg = its_irq_compose_msi_msg, }; -static void its_mask_msi_irq(struct irq_data *d) -{ - pci_msi_mask_irq(d); - irq_chip_mask_parent(d); -} - -static void its_unmask_msi_irq(struct irq_data *d) -{ - pci_msi_unmask_irq(d); - irq_chip_unmask_parent(d); -} - -static struct irq_chip its_msi_irq_chip = { - .name = "ITS-MSI", - .irq_unmask = its_unmask_msi_irq, - .irq_mask = its_mask_msi_irq, - .irq_eoi = irq_chip_eoi_parent, - .irq_write_msi_msg = pci_msi_domain_write_msg, -}; - /* * How we allocate LPIs: * @@ -742,6 +720,9 @@ static unsigned long *its_lpi_alloc_chunks(int nr_irqs, int *base, int *nr_ids) out: spin_unlock(&lpi_lock); + if (!bitmap) + *base = *nr_ids = 0; + return bitmap; } @@ -831,7 +812,7 @@ static void its_free_tables(struct its_node *its) } } -static int its_alloc_tables(struct its_node *its) +static int its_alloc_tables(const char *node_name, struct its_node *its) { int err; int i; @@ -874,7 +855,7 @@ static int its_alloc_tables(struct its_node *its) if (order >= MAX_ORDER) { order = MAX_ORDER - 1; pr_warn("%s: Device Table too large, reduce its page order to %u\n", - its->msi_chip.of_node->full_name, order); + node_name, order); } } @@ -946,7 +927,7 @@ retry_baser: if (val != tmp) { pr_err("ITS: %s: GITS_BASER%d doesn't stick: %lx %lx\n", - its->msi_chip.of_node->full_name, i, + node_name, i, (unsigned long) val, (unsigned long) tmp); err = -ENXIO; goto out_free; @@ -1213,85 +1194,50 @@ static int its_alloc_device_irq(struct its_device *dev, irq_hw_number_t *hwirq) return 0; } -struct its_pci_alias { - struct pci_dev *pdev; - u32 dev_id; - u32 count; -}; - -static int its_pci_msi_vec_count(struct pci_dev *pdev) -{ - int msi, msix; - - msi = max(pci_msi_vec_count(pdev), 0); - msix = max(pci_msix_vec_count(pdev), 0); - - return max(msi, msix); -} - -static int its_get_pci_alias(struct pci_dev *pdev, u16 alias, void *data) -{ - struct its_pci_alias *dev_alias = data; - - dev_alias->dev_id = alias; - if (pdev != dev_alias->pdev) - dev_alias->count += its_pci_msi_vec_count(dev_alias->pdev); - - return 0; -} - static int its_msi_prepare(struct irq_domain *domain, struct device *dev, int nvec, msi_alloc_info_t *info) { - struct pci_dev *pdev; struct its_node *its; struct its_device *its_dev; - struct its_pci_alias dev_alias; - - if (!dev_is_pci(dev)) - return -EINVAL; + struct msi_domain_info *msi_info; + u32 dev_id; - pdev = to_pci_dev(dev); - dev_alias.pdev = pdev; - dev_alias.count = nvec; + /* + * We ignore "dev" entierely, and rely on the dev_id that has + * been passed via the scratchpad. This limits this domain's + * usefulness to upper layers that definitely know that they + * are built on top of the ITS. + */ + dev_id = info->scratchpad[0].ul; - pci_for_each_dma_alias(pdev, its_get_pci_alias, &dev_alias); - its = domain->parent->host_data; + msi_info = msi_get_domain_info(domain); + its = msi_info->data; - its_dev = its_find_device(its, dev_alias.dev_id); + its_dev = its_find_device(its, dev_id); if (its_dev) { /* * We already have seen this ID, probably through * another alias (PCI bridge of some sort). No need to * create the device. */ - dev_dbg(dev, "Reusing ITT for devID %x\n", dev_alias.dev_id); + pr_debug("Reusing ITT for devID %x\n", dev_id); goto out; } - its_dev = its_create_device(its, dev_alias.dev_id, dev_alias.count); + its_dev = its_create_device(its, dev_id, nvec); if (!its_dev) return -ENOMEM; - dev_dbg(&pdev->dev, "ITT %d entries, %d bits\n", - dev_alias.count, ilog2(dev_alias.count)); + pr_debug("ITT %d entries, %d bits\n", nvec, ilog2(nvec)); out: info->scratchpad[0].ptr = its_dev; - info->scratchpad[1].ptr = dev; return 0; } -static struct msi_domain_ops its_pci_msi_ops = { +static struct msi_domain_ops its_msi_domain_ops = { .msi_prepare = its_msi_prepare, }; -static struct msi_domain_info its_pci_msi_domain_info = { - .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS | - MSI_FLAG_MULTI_PCI_MSI | MSI_FLAG_PCI_MSIX), - .ops = &its_pci_msi_ops, - .chip = &its_msi_irq_chip, -}; - static int its_irq_gic_domain_alloc(struct irq_domain *domain, unsigned int virq, irq_hw_number_t hwirq) @@ -1327,9 +1273,9 @@ static int its_irq_domain_alloc(struct irq_domain *domain, unsigned int virq, irq_domain_set_hwirq_and_chip(domain, virq + i, hwirq, &its_irq_chip, its_dev); - dev_dbg(info->scratchpad[1].ptr, "ID:%d pID:%d vID:%d\n", - (int)(hwirq - its_dev->event_map.lpi_base), - (int)hwirq, virq + i); + pr_debug("ID:%d pID:%d vID:%d\n", + (int)(hwirq - its_dev->event_map.lpi_base), + (int) hwirq, virq + i); } return 0; @@ -1430,6 +1376,7 @@ static int its_probe(struct device_node *node, struct irq_domain *parent) struct resource res; struct its_node *its; void __iomem *its_base; + struct irq_domain *inner_domain; u32 val; u64 baser, tmp; int err; @@ -1473,7 +1420,6 @@ static int its_probe(struct device_node *node, struct irq_domain *parent) INIT_LIST_HEAD(&its->its_device_list); its->base = its_base; its->phys_base = res.start; - its->msi_chip.of_node = node; its->ite_size = ((readl_relaxed(its_base + GITS_TYPER) >> 4) & 0xf) + 1; its->cmd_base = kzalloc(ITS_CMD_QUEUE_SZ, GFP_KERNEL); @@ -1483,7 +1429,7 @@ static int its_probe(struct device_node *node, struct irq_domain *parent) } its->cmd_write = its->cmd_base; - err = its_alloc_tables(its); + err = its_alloc_tables(node->full_name, its); if (err) goto out_free_cmd; @@ -1519,26 +1465,27 @@ static int its_probe(struct device_node *node, struct irq_domain *parent) writeq_relaxed(0, its->base + GITS_CWRITER); writel_relaxed(GITS_CTLR_ENABLE, its->base + GITS_CTLR); - if (of_property_read_bool(its->msi_chip.of_node, "msi-controller")) { - its->domain = irq_domain_add_tree(NULL, &its_domain_ops, its); - if (!its->domain) { + if (of_property_read_bool(node, "msi-controller")) { + struct msi_domain_info *info; + + info = kzalloc(sizeof(*info), GFP_KERNEL); + if (!info) { err = -ENOMEM; goto out_free_tables; } - its->domain->parent = parent; - - its->msi_chip.domain = pci_msi_create_irq_domain(node, - &its_pci_msi_domain_info, - its->domain); - if (!its->msi_chip.domain) { + inner_domain = irq_domain_add_tree(node, &its_domain_ops, its); + if (!inner_domain) { err = -ENOMEM; - goto out_free_domains; + kfree(info); + goto out_free_tables; } - err = of_pci_msi_chip_add(&its->msi_chip); - if (err) - goto out_free_domains; + inner_domain->parent = parent; + inner_domain->bus_token = DOMAIN_BUS_NEXUS; + info->ops = &its_msi_domain_ops; + info->data = its; + inner_domain->host_data = info; } spin_lock(&its_lock); @@ -1547,11 +1494,6 @@ static int its_probe(struct device_node *node, struct irq_domain *parent) return 0; -out_free_domains: - if (its->msi_chip.domain) - irq_domain_remove(its->msi_chip.domain); - if (its->domain) - irq_domain_remove(its->domain); out_free_tables: its_free_tables(its); out_free_cmd: diff --git a/drivers/md/bcache/request.c b/drivers/md/bcache/request.c index ab43faddb447..9c083b9d02d3 100644 --- a/drivers/md/bcache/request.c +++ b/drivers/md/bcache/request.c @@ -15,6 +15,7 @@ #include <linux/module.h> #include <linux/hash.h> #include <linux/random.h> +#include <linux/backing-dev.h> #include <trace/events/bcache.h> diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 62610aafaac7..12d72660ebc9 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2159,7 +2159,7 @@ static int dm_any_congested(void *congested_data, int bdi_bits) * the query about congestion status of request_queue */ if (dm_request_based(md)) - r = md->queue->backing_dev_info.state & + r = md->queue->backing_dev_info.wb.state & bdi_bits; else r = dm_table_any_congested(map, bdi_bits); diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 6123c2bf9150..4e984993d40a 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h @@ -14,6 +14,7 @@ #include <linux/device-mapper.h> #include <linux/list.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/hdreg.h> #include <linux/completion.h> #include <linux/kobject.h> diff --git a/drivers/md/md.h b/drivers/md/md.h index 4046a6c6f223..7da6e9c3cb53 100644 --- a/drivers/md/md.h +++ b/drivers/md/md.h @@ -16,6 +16,7 @@ #define _MD_MD_H #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/kobject.h> #include <linux/list.h> #include <linux/mm.h> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index bff6c1c7fecb..1342cd2ccb24 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -745,7 +745,7 @@ static int raid1_congested(struct mddev *mddev, int bits) struct r1conf *conf = mddev->private; int i, ret = 0; - if ((bits & (1 << BDI_async_congested)) && + if ((bits & (1 << WB_async_congested)) && conf->pending_count >= max_queued_requests) return 1; @@ -760,7 +760,7 @@ static int raid1_congested(struct mddev *mddev, int bits) /* Note the '|| 1' - when read_balance prefers * non-congested targets, it can be removed */ - if ((bits & (1<<BDI_async_congested)) || 1) + if ((bits & (1 << WB_async_congested)) || 1) ret |= bdi_congested(&q->backing_dev_info, bits); else ret &= bdi_congested(&q->backing_dev_info, bits); diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c index adfc83a0f023..646074d33ceb 100644 --- a/drivers/md/raid10.c +++ b/drivers/md/raid10.c @@ -914,7 +914,7 @@ static int raid10_congested(struct mddev *mddev, int bits) struct r10conf *conf = mddev->private; int i, ret = 0; - if ((bits & (1 << BDI_async_congested)) && + if ((bits & (1 << WB_async_congested)) && conf->pending_count >= max_queued_requests) return 1; diff --git a/drivers/mtd/devices/block2mtd.c b/drivers/mtd/devices/block2mtd.c index b16f3cda97ff..e2c0057737e6 100644 --- a/drivers/mtd/devices/block2mtd.c +++ b/drivers/mtd/devices/block2mtd.c @@ -20,6 +20,7 @@ #include <linux/delay.h> #include <linux/fs.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/bio.h> #include <linux/pagemap.h> #include <linux/list.h> diff --git a/drivers/of/irq.c b/drivers/of/irq.c index 1a7980692f25..c21e364dcd96 100644 --- a/drivers/of/irq.c +++ b/drivers/of/irq.c @@ -18,6 +18,7 @@ * driver. */ +#include <linux/device.h> #include <linux/errno.h> #include <linux/list.h> #include <linux/module.h> @@ -577,3 +578,23 @@ err: kfree(desc); } } + +/** + * of_msi_configure - Set the msi_domain field of a device + * @dev: device structure to associate with an MSI irq domain + * @np: device node for that device + */ +void of_msi_configure(struct device *dev, struct device_node *np) +{ + struct device_node *msi_np; + struct irq_domain *d; + + msi_np = of_parse_phandle(np, "msi-parent", 0); + if (!msi_np) + return; + + d = irq_find_matching_host(msi_np, DOMAIN_BUS_PLATFORM_MSI); + if (!d) + d = irq_find_host(msi_np); + dev_set_msi_domain(dev, d); +} diff --git a/drivers/of/platform.c b/drivers/of/platform.c index ddf8e42c9367..8a002d6151f2 100644 --- a/drivers/of/platform.c +++ b/drivers/of/platform.c @@ -184,6 +184,7 @@ static struct platform_device *of_platform_device_create_pdata( dev->dev.bus = &platform_bus_type; dev->dev.platform_data = platform_data; of_dma_configure(&dev->dev, dev->dev.of_node); + of_msi_configure(&dev->dev, dev->dev.of_node); if (of_device_add(dev) != 0) { of_dma_deconfigure(&dev->dev); diff --git a/drivers/pci/Makefile b/drivers/pci/Makefile index 73e4af400a5a..be3f631c3f75 100644 --- a/drivers/pci/Makefile +++ b/drivers/pci/Makefile @@ -33,6 +33,7 @@ obj-$(CONFIG_PCI_IOV) += iov.o # obj-$(CONFIG_ALPHA) += setup-irq.o obj-$(CONFIG_ARM) += setup-irq.o +obj-$(CONFIG_ARM64) += setup-irq.o obj-$(CONFIG_UNICORE32) += setup-irq.o obj-$(CONFIG_SUPERH) += setup-irq.o obj-$(CONFIG_MIPS) += setup-irq.o diff --git a/drivers/pci/host/Kconfig b/drivers/pci/host/Kconfig index 1dfb567b3522..e90d132d3316 100644 --- a/drivers/pci/host/Kconfig +++ b/drivers/pci/host/Kconfig @@ -53,7 +53,7 @@ config PCI_RCAR_GEN2_PCIE config PCI_HOST_GENERIC bool "Generic PCI host controller" - depends on ARM && OF + depends on (ARM || ARM64) && OF help Say Y here if you want to support a simple generic PCI host controller, such as the one emulated by kvmtool. @@ -89,11 +89,20 @@ config PCI_XGENE depends on ARCH_XGENE depends on OF select PCIEPORTBUS + select PCI_MSI_IRQ_DOMAIN if PCI_MSI help Say Y here if you want internal PCI support on APM X-Gene SoC. There are 5 internal PCIe ports available. Each port is GEN3 capable and have varied lanes from x1 to x8. +config PCI_XGENE_MSI + bool "X-Gene v1 PCIe MSI feature" + depends on PCI_XGENE && PCI_MSI + default y + help + Say Y here if you want PCIe MSI support for the APM X-Gene v1 SoC. + This MSI driver supports 5 PCIe ports on the APM X-Gene v1 SoC. + config PCI_LAYERSCAPE bool "Freescale Layerscape PCIe controller" depends on OF && ARM diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile index f733b4e27642..1957431d3fcc 100644 --- a/drivers/pci/host/Makefile +++ b/drivers/pci/host/Makefile @@ -11,6 +11,7 @@ obj-$(CONFIG_PCIE_SPEAR13XX) += pcie-spear13xx.o obj-$(CONFIG_PCI_KEYSTONE) += pci-keystone-dw.o pci-keystone.o obj-$(CONFIG_PCIE_XILINX) += pcie-xilinx.o obj-$(CONFIG_PCI_XGENE) += pci-xgene.o +obj-$(CONFIG_PCI_XGENE_MSI) += pci-xgene-msi.o obj-$(CONFIG_PCI_LAYERSCAPE) += pci-layerscape.o obj-$(CONFIG_PCI_VERSATILE) += pci-versatile.o obj-$(CONFIG_PCIE_IPROC) += pcie-iproc.o diff --git a/drivers/pci/host/pci-host-generic.c b/drivers/pci/host/pci-host-generic.c index ba46e581db99..265dd25169bf 100644 --- a/drivers/pci/host/pci-host-generic.c +++ b/drivers/pci/host/pci-host-generic.c @@ -38,7 +38,16 @@ struct gen_pci_cfg_windows { const struct gen_pci_cfg_bus_ops *ops; }; +/* + * ARM pcibios functions expect the ARM struct pci_sys_data as the PCI + * sysdata. Add pci_sys_data as the first element in struct gen_pci so + * that when we use a gen_pci pointer as sysdata, it is also a pointer to + * a struct pci_sys_data. + */ struct gen_pci { +#ifdef CONFIG_ARM + struct pci_sys_data sys; +#endif struct pci_host_bridge host; struct gen_pci_cfg_windows cfg; struct list_head resources; @@ -48,8 +57,7 @@ static void __iomem *gen_pci_map_cfg_bus_cam(struct pci_bus *bus, unsigned int devfn, int where) { - struct pci_sys_data *sys = bus->sysdata; - struct gen_pci *pci = sys->private_data; + struct gen_pci *pci = bus->sysdata; resource_size_t idx = bus->number - pci->cfg.bus_range->start; return pci->cfg.win[idx] + ((devfn << 8) | where); @@ -64,8 +72,7 @@ static void __iomem *gen_pci_map_cfg_bus_ecam(struct pci_bus *bus, unsigned int devfn, int where) { - struct pci_sys_data *sys = bus->sysdata; - struct gen_pci *pci = sys->private_data; + struct gen_pci *pci = bus->sysdata; resource_size_t idx = bus->number - pci->cfg.bus_range->start; return pci->cfg.win[idx] + ((devfn << 12) | where); @@ -198,13 +205,6 @@ static int gen_pci_parse_map_cfg_windows(struct gen_pci *pci) return 0; } -static int gen_pci_setup(int nr, struct pci_sys_data *sys) -{ - struct gen_pci *pci = sys->private_data; - list_splice_init(&pci->resources, &sys->resources); - return 1; -} - static int gen_pci_probe(struct platform_device *pdev) { int err; @@ -214,13 +214,7 @@ static int gen_pci_probe(struct platform_device *pdev) struct device *dev = &pdev->dev; struct device_node *np = dev->of_node; struct gen_pci *pci = devm_kzalloc(dev, sizeof(*pci), GFP_KERNEL); - struct hw_pci hw = { - .nr_controllers = 1, - .private_data = (void **)&pci, - .setup = gen_pci_setup, - .map_irq = of_irq_parse_and_map_pci, - .ops = &gen_pci_ops, - }; + struct pci_bus *bus, *child; if (!pci) return -ENOMEM; @@ -258,7 +252,27 @@ static int gen_pci_probe(struct platform_device *pdev) return err; } - pci_common_init_dev(dev, &hw); + /* Do not reassign resources if probe only */ + if (!pci_has_flag(PCI_PROBE_ONLY)) + pci_add_flags(PCI_REASSIGN_ALL_RSRC | PCI_REASSIGN_ALL_BUS); + + bus = pci_scan_root_bus(dev, 0, &gen_pci_ops, pci, &pci->resources); + if (!bus) { + dev_err(dev, "Scanning rootbus failed"); + return -ENODEV; + } + + pci_fixup_irqs(pci_common_swizzle, of_irq_parse_and_map_pci); + + if (!pci_has_flag(PCI_PROBE_ONLY)) { + pci_bus_size_bridges(bus); + pci_bus_assign_resources(bus); + + list_for_each_entry(child, &bus->children, node) + pcie_bus_configure_settings(child); + } + + pci_bus_add_devices(bus); return 0; } diff --git a/drivers/pci/host/pci-xgene-msi.c b/drivers/pci/host/pci-xgene-msi.c new file mode 100644 index 000000000000..ec5a14b3189b --- /dev/null +++ b/drivers/pci/host/pci-xgene-msi.c @@ -0,0 +1,587 @@ +/* + * APM X-Gene MSI Driver + * + * Copyright (c) 2014, Applied Micro Circuits Corporation + * Author: Tanmay Inamdar <tinamdar@apm.com> + * Duc Dang <dhdang@apm.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms of the GNU General Public License as published by the + * Free Software Foundation; either version 2 of the License, or (at your + * option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ +#include <linux/cpu.h> +#include <linux/interrupt.h> +#include <linux/module.h> +#include <linux/msi.h> +#include <linux/of_irq.h> +#include <linux/irqchip/chained_irq.h> +#include <linux/pci.h> +#include <linux/platform_device.h> +#include <linux/of_pci.h> + +#define MSI_IR0 0x000000 +#define MSI_INT0 0x800000 +#define IDX_PER_GROUP 8 +#define IRQS_PER_IDX 16 +#define NR_HW_IRQS 16 +#define NR_MSI_VEC (IDX_PER_GROUP * IRQS_PER_IDX * NR_HW_IRQS) + +struct xgene_msi_group { + struct xgene_msi *msi; + int gic_irq; + u32 msi_grp; +}; + +struct xgene_msi { + struct device_node *node; + struct irq_domain *inner_domain; + struct irq_domain *msi_domain; + u64 msi_addr; + void __iomem *msi_regs; + unsigned long *bitmap; + struct mutex bitmap_lock; + struct xgene_msi_group *msi_groups; + int num_cpus; +}; + +/* Global data */ +static struct xgene_msi xgene_msi_ctrl; + +static struct irq_chip xgene_msi_top_irq_chip = { + .name = "X-Gene1 MSI", + .irq_enable = pci_msi_unmask_irq, + .irq_disable = pci_msi_mask_irq, + .irq_mask = pci_msi_mask_irq, + .irq_unmask = pci_msi_unmask_irq, +}; + +static struct msi_domain_info xgene_msi_domain_info = { + .flags = (MSI_FLAG_USE_DEF_DOM_OPS | MSI_FLAG_USE_DEF_CHIP_OPS | + MSI_FLAG_PCI_MSIX), + .chip = &xgene_msi_top_irq_chip, +}; + +/* + * X-Gene v1 has 16 groups of MSI termination registers MSInIRx, where + * n is group number (0..F), x is index of registers in each group (0..7) + * The register layout is as follows: + * MSI0IR0 base_addr + * MSI0IR1 base_addr + 0x10000 + * ... ... + * MSI0IR6 base_addr + 0x60000 + * MSI0IR7 base_addr + 0x70000 + * MSI1IR0 base_addr + 0x80000 + * MSI1IR1 base_addr + 0x90000 + * ... ... + * MSI1IR7 base_addr + 0xF0000 + * MSI2IR0 base_addr + 0x100000 + * ... ... + * MSIFIR0 base_addr + 0x780000 + * MSIFIR1 base_addr + 0x790000 + * ... ... + * MSIFIR7 base_addr + 0x7F0000 + * MSIINT0 base_addr + 0x800000 + * MSIINT1 base_addr + 0x810000 + * ... ... + * MSIINTF base_addr + 0x8F0000 + * + * Each index register supports 16 MSI vectors (0..15) to generate interrupt. + * There are total 16 GIC IRQs assigned for these 16 groups of MSI termination + * registers. + * + * Each MSI termination group has 1 MSIINTn register (n is 0..15) to indicate + * the MSI pending status caused by 1 of its 8 index registers. + */ + +/* MSInIRx read helper */ +static u32 xgene_msi_ir_read(struct xgene_msi *msi, + u32 msi_grp, u32 msir_idx) +{ + return readl_relaxed(msi->msi_regs + MSI_IR0 + + (msi_grp << 19) + (msir_idx << 16)); +} + +/* MSIINTn read helper */ +static u32 xgene_msi_int_read(struct xgene_msi *msi, u32 msi_grp) +{ + return readl_relaxed(msi->msi_regs + MSI_INT0 + (msi_grp << 16)); +} + +/* + * With 2048 MSI vectors supported, the MSI message can be constructed using + * following scheme: + * - Divide into 8 256-vector groups + * Group 0: 0-255 + * Group 1: 256-511 + * Group 2: 512-767 + * ... + * Group 7: 1792-2047 + * - Each 256-vector group is divided into 16 16-vector groups + * As an example: 16 16-vector groups for 256-vector group 0-255 is + * Group 0: 0-15 + * Group 1: 16-32 + * ... + * Group 15: 240-255 + * - The termination address of MSI vector in 256-vector group n and 16-vector + * group x is the address of MSIxIRn + * - The data for MSI vector in 16-vector group x is x + */ +static u32 hwirq_to_reg_set(unsigned long hwirq) +{ + return (hwirq / (NR_HW_IRQS * IRQS_PER_IDX)); +} + +static u32 hwirq_to_group(unsigned long hwirq) +{ + return (hwirq % NR_HW_IRQS); +} + +static u32 hwirq_to_msi_data(unsigned long hwirq) +{ + return ((hwirq / NR_HW_IRQS) % IRQS_PER_IDX); +} + +static void xgene_compose_msi_msg(struct irq_data *data, struct msi_msg *msg) +{ + struct xgene_msi *msi = irq_data_get_irq_chip_data(data); + u32 reg_set = hwirq_to_reg_set(data->hwirq); + u32 group = hwirq_to_group(data->hwirq); + u64 target_addr = msi->msi_addr + (((8 * group) + reg_set) << 16); + + msg->address_hi = upper_32_bits(target_addr); + msg->address_lo = lower_32_bits(target_addr); + msg->data = hwirq_to_msi_data(data->hwirq); +} + +/* + * X-Gene v1 only has 16 MSI GIC IRQs for 2048 MSI vectors. To maintain + * the expected behaviour of .set_affinity for each MSI interrupt, the 16 + * MSI GIC IRQs are statically allocated to 8 X-Gene v1 cores (2 GIC IRQs + * for each core). The MSI vector is moved fom 1 MSI GIC IRQ to another + * MSI GIC IRQ to steer its MSI interrupt to correct X-Gene v1 core. As a + * consequence, the total MSI vectors that X-Gene v1 supports will be + * reduced to 256 (2048/8) vectors. + */ +static int hwirq_to_cpu(unsigned long hwirq) +{ + return (hwirq % xgene_msi_ctrl.num_cpus); +} + +static unsigned long hwirq_to_canonical_hwirq(unsigned long hwirq) +{ + return (hwirq - hwirq_to_cpu(hwirq)); +} + +static int xgene_msi_set_affinity(struct irq_data *irqdata, + const struct cpumask *mask, bool force) +{ + int target_cpu = cpumask_first(mask); + int curr_cpu; + + curr_cpu = hwirq_to_cpu(irqdata->hwirq); + if (curr_cpu == target_cpu) + return IRQ_SET_MASK_OK_DONE; + + /* Update MSI number to target the new CPU */ + irqdata->hwirq = hwirq_to_canonical_hwirq(irqdata->hwirq) + target_cpu; + + return IRQ_SET_MASK_OK; +} + +static struct irq_chip xgene_msi_bottom_irq_chip = { + .name = "MSI", + .irq_set_affinity = xgene_msi_set_affinity, + .irq_compose_msi_msg = xgene_compose_msi_msg, +}; + +static int xgene_irq_domain_alloc(struct irq_domain *domain, unsigned int virq, + unsigned int nr_irqs, void *args) +{ + struct xgene_msi *msi = domain->host_data; + int msi_irq; + + mutex_lock(&msi->bitmap_lock); + + msi_irq = bitmap_find_next_zero_area(msi->bitmap, NR_MSI_VEC, 0, + msi->num_cpus, 0); + if (msi_irq < NR_MSI_VEC) + bitmap_set(msi->bitmap, msi_irq, msi->num_cpus); + else + msi_irq = -ENOSPC; + + mutex_unlock(&msi->bitmap_lock); + + if (msi_irq < 0) + return msi_irq; + + irq_domain_set_info(domain, virq, msi_irq, + &xgene_msi_bottom_irq_chip, domain->host_data, + handle_simple_irq, NULL, NULL); + set_irq_flags(virq, IRQF_VALID); + + return 0; +} + +static void xgene_irq_domain_free(struct irq_domain *domain, + unsigned int virq, unsigned int nr_irqs) +{ + struct irq_data *d = irq_domain_get_irq_data(domain, virq); + struct xgene_msi *msi = irq_data_get_irq_chip_data(d); + u32 hwirq; + + mutex_lock(&msi->bitmap_lock); + + hwirq = hwirq_to_canonical_hwirq(d->hwirq); + bitmap_clear(msi->bitmap, hwirq, msi->num_cpus); + + mutex_unlock(&msi->bitmap_lock); + + irq_domain_free_irqs_parent(domain, virq, nr_irqs); +} + +static const struct irq_domain_ops msi_domain_ops = { + .alloc = xgene_irq_domain_alloc, + .free = xgene_irq_domain_free, +}; + +static int xgene_allocate_domains(struct xgene_msi *msi) +{ + msi->inner_domain = irq_domain_add_linear(NULL, NR_MSI_VEC, + &msi_domain_ops, msi); + if (!msi->inner_domain) + return -ENOMEM; + + msi->msi_domain = pci_msi_create_irq_domain(msi->node, + &xgene_msi_domain_info, + msi->inner_domain); + + if (!msi->msi_domain) { + irq_domain_remove(msi->inner_domain); + return -ENOMEM; + } + + return 0; +} + +static void xgene_free_domains(struct xgene_msi *msi) +{ + if (msi->msi_domain) + irq_domain_remove(msi->msi_domain); + if (msi->inner_domain) + irq_domain_remove(msi->inner_domain); +} + +static int xgene_msi_init_allocator(struct xgene_msi *xgene_msi) +{ + int size = BITS_TO_LONGS(NR_MSI_VEC) * sizeof(long); + + xgene_msi->bitmap = kzalloc(size, GFP_KERNEL); + if (!xgene_msi->bitmap) + return -ENOMEM; + + mutex_init(&xgene_msi->bitmap_lock); + + xgene_msi->msi_groups = kcalloc(NR_HW_IRQS, + sizeof(struct xgene_msi_group), + GFP_KERNEL); + if (!xgene_msi->msi_groups) + return -ENOMEM; + + return 0; +} + +static void xgene_msi_isr(unsigned int irq, struct irq_desc *desc) +{ + struct irq_chip *chip = irq_desc_get_chip(desc); + struct xgene_msi_group *msi_groups; + struct xgene_msi *xgene_msi; + unsigned int virq; + int msir_index, msir_val, hw_irq; + u32 intr_index, grp_select, msi_grp; + + chained_irq_enter(chip, desc); + + msi_groups = irq_desc_get_handler_data(desc); + xgene_msi = msi_groups->msi; + msi_grp = msi_groups->msi_grp; + + /* + * MSIINTn (n is 0..F) indicates if there is a pending MSI interrupt + * If bit x of this register is set (x is 0..7), one or more interupts + * corresponding to MSInIRx is set. + */ + grp_select = xgene_msi_int_read(xgene_msi, msi_grp); + while (grp_select) { + msir_index = ffs(grp_select) - 1; + /* + * Calculate MSInIRx address to read to check for interrupts + * (refer to termination address and data assignment + * described in xgene_compose_msi_msg() ) + */ + msir_val = xgene_msi_ir_read(xgene_msi, msi_grp, msir_index); + while (msir_val) { + intr_index = ffs(msir_val) - 1; + /* + * Calculate MSI vector number (refer to the termination + * address and data assignment described in + * xgene_compose_msi_msg function) + */ + hw_irq = (((msir_index * IRQS_PER_IDX) + intr_index) * + NR_HW_IRQS) + msi_grp; + /* + * As we have multiple hw_irq that maps to single MSI, + * always look up the virq using the hw_irq as seen from + * CPU0 + */ + hw_irq = hwirq_to_canonical_hwirq(hw_irq); + virq = irq_find_mapping(xgene_msi->inner_domain, hw_irq); + WARN_ON(!virq); + if (virq != 0) + generic_handle_irq(virq); + msir_val &= ~(1 << intr_index); + } + grp_select &= ~(1 << msir_index); + + if (!grp_select) { + /* + * We handled all interrupts happened in this group, + * resample this group MSI_INTx register in case + * something else has been made pending in the meantime + */ + grp_select = xgene_msi_int_read(xgene_msi, msi_grp); + } + } + + chained_irq_exit(chip, desc); +} + +static int xgene_msi_remove(struct platform_device *pdev) +{ + int virq, i; + struct xgene_msi *msi = platform_get_drvdata(pdev); + + for (i = 0; i < NR_HW_IRQS; i++) { + virq = msi->msi_groups[i].gic_irq; + if (virq != 0) { + irq_set_chained_handler(virq, NULL); + irq_set_handler_data(virq, NULL); + } + } + kfree(msi->msi_groups); + + kfree(msi->bitmap); + msi->bitmap = NULL; + + xgene_free_domains(msi); + + return 0; +} + +static int xgene_msi_hwirq_alloc(unsigned int cpu) +{ + struct xgene_msi *msi = &xgene_msi_ctrl; + struct xgene_msi_group *msi_group; + cpumask_var_t mask; + int i; + int err; + + for (i = cpu; i < NR_HW_IRQS; i += msi->num_cpus) { + msi_group = &msi->msi_groups[i]; + if (!msi_group->gic_irq) + continue; + + irq_set_chained_handler(msi_group->gic_irq, + xgene_msi_isr); + err = irq_set_handler_data(msi_group->gic_irq, msi_group); + if (err) { + pr_err("failed to register GIC IRQ handler\n"); + return -EINVAL; + } + /* + * Statically allocate MSI GIC IRQs to each CPU core. + * With 8-core X-Gene v1, 2 MSI GIC IRQs are allocated + * to each core. + */ + if (alloc_cpumask_var(&mask, GFP_KERNEL)) { + cpumask_clear(mask); + cpumask_set_cpu(cpu, mask); + err = irq_set_affinity(msi_group->gic_irq, mask); + if (err) + pr_err("failed to set affinity for GIC IRQ"); + free_cpumask_var(mask); + } else { + pr_err("failed to alloc CPU mask for affinity\n"); + err = -EINVAL; + } + + if (err) { + irq_set_chained_handler(msi_group->gic_irq, NULL); + irq_set_handler_data(msi_group->gic_irq, NULL); + return err; + } + } + + return 0; +} + +static void xgene_msi_hwirq_free(unsigned int cpu) +{ + struct xgene_msi *msi = &xgene_msi_ctrl; + struct xgene_msi_group *msi_group; + int i; + + for (i = cpu; i < NR_HW_IRQS; i += msi->num_cpus) { + msi_group = &msi->msi_groups[i]; + if (!msi_group->gic_irq) + continue; + + irq_set_chained_handler(msi_group->gic_irq, NULL); + irq_set_handler_data(msi_group->gic_irq, NULL); + } +} + +static int xgene_msi_cpu_callback(struct notifier_block *nfb, + unsigned long action, void *hcpu) +{ + unsigned cpu = (unsigned long)hcpu; + + switch (action) { + case CPU_ONLINE: + case CPU_ONLINE_FROZEN: + xgene_msi_hwirq_alloc(cpu); + break; + case CPU_DEAD: + case CPU_DEAD_FROZEN: + xgene_msi_hwirq_free(cpu); + break; + default: + break; + } + + return NOTIFY_OK; +} + +static struct notifier_block xgene_msi_cpu_notifier = { + .notifier_call = xgene_msi_cpu_callback, +}; + +static const struct of_device_id xgene_msi_match_table[] = { + {.compatible = "apm,xgene1-msi"}, + {}, +}; + +static int xgene_msi_probe(struct platform_device *pdev) +{ + struct resource *res; + int rc, irq_index; + struct xgene_msi *xgene_msi; + unsigned int cpu; + int virt_msir; + u32 msi_val, msi_idx; + + xgene_msi = &xgene_msi_ctrl; + + platform_set_drvdata(pdev, xgene_msi); + + res = platform_get_resource(pdev, IORESOURCE_MEM, 0); + xgene_msi->msi_regs = devm_ioremap_resource(&pdev->dev, res); + if (IS_ERR(xgene_msi->msi_regs)) { + dev_err(&pdev->dev, "no reg space\n"); + rc = -EINVAL; + goto error; + } + xgene_msi->msi_addr = res->start; + xgene_msi->node = pdev->dev.of_node; + xgene_msi->num_cpus = num_possible_cpus(); + + rc = xgene_msi_init_allocator(xgene_msi); + if (rc) { + dev_err(&pdev->dev, "Error allocating MSI bitmap\n"); + goto error; + } + + rc = xgene_allocate_domains(xgene_msi); + if (rc) { + dev_err(&pdev->dev, "Failed to allocate MSI domain\n"); + goto error; + } + + for (irq_index = 0; irq_index < NR_HW_IRQS; irq_index++) { + virt_msir = platform_get_irq(pdev, irq_index); + if (virt_msir < 0) { + dev_err(&pdev->dev, "Cannot translate IRQ index %d\n", + irq_index); + rc = -EINVAL; + goto error; + } + xgene_msi->msi_groups[irq_index].gic_irq = virt_msir; + xgene_msi->msi_groups[irq_index].msi_grp = irq_index; + xgene_msi->msi_groups[irq_index].msi = xgene_msi; + } + + /* + * MSInIRx registers are read-to-clear; before registering + * interrupt handlers, read all of them to clear spurious + * interrupts that may occur before the driver is probed. + */ + for (irq_index = 0; irq_index < NR_HW_IRQS; irq_index++) { + for (msi_idx = 0; msi_idx < IDX_PER_GROUP; msi_idx++) + msi_val = xgene_msi_ir_read(xgene_msi, irq_index, + msi_idx); + /* Read MSIINTn to confirm */ + msi_val = xgene_msi_int_read(xgene_msi, irq_index); + if (msi_val) { + dev_err(&pdev->dev, "Failed to clear spurious IRQ\n"); + rc = -EINVAL; + goto error; + } + } + + cpu_notifier_register_begin(); + + for_each_online_cpu(cpu) + if (xgene_msi_hwirq_alloc(cpu)) { + dev_err(&pdev->dev, "failed to register MSI handlers\n"); + cpu_notifier_register_done(); + goto error; + } + + rc = __register_hotcpu_notifier(&xgene_msi_cpu_notifier); + if (rc) { + dev_err(&pdev->dev, "failed to add CPU MSI notifier\n"); + cpu_notifier_register_done(); + goto error; + } + + cpu_notifier_register_done(); + + dev_info(&pdev->dev, "APM X-Gene PCIe MSI driver loaded\n"); + + return 0; + +error: + xgene_msi_remove(pdev); + return rc; +} + +static struct platform_driver xgene_msi_driver = { + .driver = { + .name = "xgene-msi", + .owner = THIS_MODULE, + .of_match_table = xgene_msi_match_table, + }, + .probe = xgene_msi_probe, + .remove = xgene_msi_remove, +}; + +static int __init xgene_pcie_msi_init(void) +{ + return platform_driver_register(&xgene_msi_driver); +} +subsys_initcall(xgene_pcie_msi_init); diff --git a/drivers/pci/host/pci-xgene.c b/drivers/pci/host/pci-xgene.c index ee082c0366ec..3e5a636c9a9a 100644 --- a/drivers/pci/host/pci-xgene.c +++ b/drivers/pci/host/pci-xgene.c @@ -468,6 +468,23 @@ static int xgene_pcie_setup(struct xgene_pcie_port *port, return 0; } +static int xgene_pcie_msi_enable(struct pci_bus *bus) +{ + struct device_node *msi_node; + + msi_node = of_parse_phandle(bus->dev.of_node, + "msi-parent", 0); + if (!msi_node) + return -ENODEV; + + bus->msi = of_pci_find_msi_chip_by_node(msi_node); + if (!bus->msi) + return -ENODEV; + + bus->msi->dev = &bus->dev; + return 0; +} + static int xgene_pcie_probe_bridge(struct platform_device *pdev) { struct device_node *dn = pdev->dev.of_node; @@ -504,6 +521,10 @@ static int xgene_pcie_probe_bridge(struct platform_device *pdev) if (!bus) return -ENOMEM; + if (IS_ENABLED(CONFIG_PCI_MSI)) + if (xgene_pcie_msi_enable(bus)) + dev_info(port->dev, "failed to enable MSI\n"); + pci_scan_child_bus(bus); pci_assign_unassigned_bus_resources(bus); pci_bus_add_devices(bus); diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index c3e7dfcf9ff5..fc54038495da 100644 --- a/drivers/pci/msi.c +++ b/drivers/pci/msi.c @@ -39,14 +39,13 @@ struct irq_domain * __weak arch_get_pci_msi_domain(struct pci_dev *dev) static struct irq_domain *pci_msi_get_domain(struct pci_dev *dev) { - struct irq_domain *domain = NULL; + struct irq_domain *domain; - if (dev->bus->msi) - domain = dev->bus->msi->domain; - if (!domain) - domain = arch_get_pci_msi_domain(dev); + domain = dev_get_msi_domain(&dev->dev); + if (domain) + return domain; - return domain; + return arch_get_pci_msi_domain(dev); } static int pci_msi_setup_msi_irqs(struct pci_dev *dev, int nvec, int type) @@ -1302,12 +1301,19 @@ struct irq_domain *pci_msi_create_irq_domain(struct device_node *node, struct msi_domain_info *info, struct irq_domain *parent) { + struct irq_domain *domain; + if (info->flags & MSI_FLAG_USE_DEF_DOM_OPS) pci_msi_domain_update_dom_ops(info); if (info->flags & MSI_FLAG_USE_DEF_CHIP_OPS) pci_msi_domain_update_chip_ops(info); - return msi_create_irq_domain(node, info, parent); + domain = msi_create_irq_domain(node, info, parent); + if (!domain) + return NULL; + + domain->bus_token = DOMAIN_BUS_PCI_MSI; + return domain; } /** diff --git a/drivers/pci/of.c b/drivers/pci/of.c index f0929934bb7a..2e99a500cb83 100644 --- a/drivers/pci/of.c +++ b/drivers/pci/of.c @@ -9,6 +9,7 @@ * 2 of the License, or (at your option) any later version. */ +#include <linux/irqdomain.h> #include <linux/kernel.h> #include <linux/pci.h> #include <linux/of.h> @@ -59,3 +60,32 @@ struct device_node * __weak pcibios_get_phb_of_node(struct pci_bus *bus) return of_node_get(bus->bridge->parent->of_node); return NULL; } + +struct irq_domain *pci_host_bridge_of_msi_domain(struct pci_bus *bus) +{ +#ifdef CONFIG_IRQ_DOMAIN + struct device_node *np; + struct irq_domain *d; + + if (!bus->dev.of_node) + return NULL; + + /* Start looking for a phandle to an MSI controller. */ + np = of_parse_phandle(bus->dev.of_node, "msi-parent", 0); + + /* + * If we don't have an msi-parent property, look for a domain + * directly attached to the host bridge. + */ + if (!np) + np = bus->dev.of_node; + + d = irq_find_matching_host(np, DOMAIN_BUS_PCI_MSI); + if (d) + return d; + + return irq_find_host(np); +#else + return NULL; +#endif +} diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c index c91185721345..8a369ed5ea3f 100644 --- a/drivers/pci/probe.c +++ b/drivers/pci/probe.c @@ -661,6 +661,35 @@ static void pci_set_bus_speed(struct pci_bus *bus) } } +static struct irq_domain *pci_host_bridge_msi_domain(struct pci_bus *bus) +{ + struct irq_domain *d; + + /* + * Any firmware interface that can resolve the msi_domain + * should be called from here. + */ + d = pci_host_bridge_of_msi_domain(bus); + + return d; +} + +static void pci_set_bus_msi_domain(struct pci_bus *bus) +{ + struct irq_domain *d; + + /* + * Either bus is the root, and we must obtain it from the + * firmware, or we inherit it from the bridge device. + */ + if (pci_is_root_bus(bus)) + d = pci_host_bridge_msi_domain(bus); + else + d = dev_get_msi_domain(&bus->self->dev); + + dev_set_msi_domain(&bus->dev, d); +} + static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, struct pci_dev *bridge, int busnr) { @@ -714,6 +743,7 @@ static struct pci_bus *pci_alloc_child_bus(struct pci_bus *parent, bridge->subordinate = child; add_dev: + pci_set_bus_msi_domain(child); ret = device_register(&child->dev); WARN_ON(ret < 0); @@ -1508,6 +1538,17 @@ static void pci_init_capabilities(struct pci_dev *dev) pci_enable_acs(dev); } +static void pci_set_msi_domain(struct pci_dev *dev) +{ + /* + * If no domain has been set through the pcibios_add_device + * callback, inherit the default from the bus device. + */ + if (!dev_get_msi_domain(&dev->dev)) + dev_set_msi_domain(&dev->dev, + dev_get_msi_domain(&dev->bus->dev)); +} + void pci_device_add(struct pci_dev *dev, struct pci_bus *bus) { int ret; @@ -1549,6 +1590,9 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus) ret = pcibios_add_device(dev); WARN_ON(ret < 0); + /* Setup MSI irq domain */ + pci_set_msi_domain(dev); + /* Notifier could use PCI capabilities */ dev->match_driver = false; ret = device_add(&dev->dev); @@ -1939,6 +1983,7 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int bus, b->bridge = get_device(&bridge->dev); device_enable_async_suspend(b->bridge); pci_set_bus_of_node(b); + pci_set_bus_msi_domain(b); if (!parent) set_dev_node(b->bridge, pcibus_to_node(b)); diff --git a/drivers/platform/x86/acerhdf.c b/drivers/platform/x86/acerhdf.c index 594c918b553d..1ef02daddb60 100644 --- a/drivers/platform/x86/acerhdf.c +++ b/drivers/platform/x86/acerhdf.c @@ -372,7 +372,8 @@ static int acerhdf_bind(struct thermal_zone_device *thermal, return 0; if (thermal_zone_bind_cooling_device(thermal, 0, cdev, - THERMAL_NO_LIMIT, THERMAL_NO_LIMIT)) { + THERMAL_NO_LIMIT, THERMAL_NO_LIMIT, + THERMAL_WEIGHT_DEFAULT)) { pr_err("error binding cooling dev\n"); return -EINVAL; } diff --git a/drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h b/drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h index d72605864b0a..14562788e4e0 100644 --- a/drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h +++ b/drivers/staging/lustre/lustre/include/linux/lustre_patchless_compat.h @@ -55,9 +55,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page) if (PagePrivate(page)) page->mapping->a_ops->invalidatepage(page, 0, PAGE_CACHE_SIZE); - if (TestClearPageDirty(page)) - account_page_cleaned(page, mapping); - + cancel_dirty_page(page); ClearPageMappedToDisk(page); ll_delete_from_page_cache(page); } diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig index af40db0df58e..2acc88378099 100644 --- a/drivers/thermal/Kconfig +++ b/drivers/thermal/Kconfig @@ -42,6 +42,17 @@ config THERMAL_OF Say 'Y' here if you need to build thermal infrastructure based on device tree. +config THERMAL_WRITABLE_TRIPS + bool "Enable writable trip points" + help + This option allows the system integrator to choose whether + trip temperatures can be changed from userspace. The + writable trips need to be specified when setting up the + thermal zone but the choice here takes precedence. + + Say 'Y' here if you would like to allow userspace tools to + change trip temperatures. + choice prompt "Default Thermal governor" default THERMAL_DEFAULT_GOV_STEP_WISE @@ -71,6 +82,14 @@ config THERMAL_DEFAULT_GOV_USER_SPACE Select this if you want to let the user space manage the platform thermals. +config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR + bool "power_allocator" + select THERMAL_GOV_POWER_ALLOCATOR + help + Select this if you want to control temperature based on + system and device power allocation. This governor can only + operate on cooling devices that implement the power API. + endchoice config THERMAL_GOV_FAIR_SHARE @@ -99,6 +118,13 @@ config THERMAL_GOV_USER_SPACE help Enable this to let the user space manage the platform thermals. +config THERMAL_GOV_POWER_ALLOCATOR + bool "Power allocator thermal governor" + select THERMAL_POWER_ACTOR + help + Enable this to manage platform thermals by dynamically + allocating and limiting power to devices. + config CPU_THERMAL bool "generic cpu cooling support" depends on CPU_FREQ diff --git a/drivers/thermal/Makefile b/drivers/thermal/Makefile index fa0dc486790f..cd769ab06cbb 100644 --- a/drivers/thermal/Makefile +++ b/drivers/thermal/Makefile @@ -14,6 +14,7 @@ thermal_sys-$(CONFIG_THERMAL_GOV_FAIR_SHARE) += fair_share.o thermal_sys-$(CONFIG_THERMAL_GOV_BANG_BANG) += gov_bang_bang.o thermal_sys-$(CONFIG_THERMAL_GOV_STEP_WISE) += step_wise.o thermal_sys-$(CONFIG_THERMAL_GOV_USER_SPACE) += user_space.o +thermal_sys-$(CONFIG_THERMAL_GOV_POWER_ALLOCATOR) += power_allocator.o # cpufreq cooling thermal_sys-$(CONFIG_CPU_THERMAL) += cpu_cooling.o diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c index f65f0d109fc8..6509c61b9648 100644 --- a/drivers/thermal/cpu_cooling.c +++ b/drivers/thermal/cpu_cooling.c @@ -26,10 +26,13 @@ #include <linux/thermal.h> #include <linux/cpufreq.h> #include <linux/err.h> +#include <linux/pm_opp.h> #include <linux/slab.h> #include <linux/cpu.h> #include <linux/cpu_cooling.h> +#include <trace/events/thermal.h> + /* * Cooling state <-> CPUFreq frequency * @@ -45,6 +48,19 @@ */ /** + * struct power_table - frequency to power conversion + * @frequency: frequency in KHz + * @power: power in mW + * + * This structure is built when the cooling device registers and helps + * in translating frequency to power and viceversa. + */ +struct power_table { + u32 frequency; + u32 power; +}; + +/** * struct cpufreq_cooling_device - data for cooling device with cpufreq * @id: unique integer value corresponding to each cpufreq_cooling_device * registered. @@ -58,6 +74,15 @@ * cpufreq frequencies. * @allowed_cpus: all the cpus involved for this cpufreq_cooling_device. * @node: list_head to link all cpufreq_cooling_device together. + * @last_load: load measured by the latest call to cpufreq_get_actual_power() + * @time_in_idle: previous reading of the absolute time that this cpu was idle + * @time_in_idle_timestamp: wall time of the last invocation of + * get_cpu_idle_time_us() + * @dyn_power_table: array of struct power_table for frequency to power + * conversion, sorted in ascending order. + * @dyn_power_table_entries: number of entries in the @dyn_power_table array + * @cpu_dev: the first cpu_device from @allowed_cpus that has OPPs registered + * @plat_get_static_power: callback to calculate the static power * * This structure is required for keeping information of each registered * cpufreq_cooling_device. @@ -71,6 +96,13 @@ struct cpufreq_cooling_device { unsigned int *freq_table; /* In descending order */ struct cpumask allowed_cpus; struct list_head node; + u32 last_load; + u64 *time_in_idle; + u64 *time_in_idle_timestamp; + struct power_table *dyn_power_table; + int dyn_power_table_entries; + struct device *cpu_dev; + get_static_t plat_get_static_power; }; static DEFINE_IDR(cpufreq_idr); static DEFINE_MUTEX(cooling_cpufreq_lock); @@ -186,23 +218,237 @@ static int cpufreq_thermal_notifier(struct notifier_block *nb, unsigned long max_freq = 0; struct cpufreq_cooling_device *cpufreq_dev; - if (event != CPUFREQ_ADJUST) - return 0; + switch (event) { - mutex_lock(&cooling_cpufreq_lock); - list_for_each_entry(cpufreq_dev, &cpufreq_dev_list, node) { - if (!cpumask_test_cpu(policy->cpu, - &cpufreq_dev->allowed_cpus)) + case CPUFREQ_ADJUST: + mutex_lock(&cooling_cpufreq_lock); + list_for_each_entry(cpufreq_dev, &cpufreq_dev_list, node) { + if (!cpumask_test_cpu(policy->cpu, + &cpufreq_dev->allowed_cpus)) + continue; + + max_freq = cpufreq_dev->cpufreq_val; + + if (policy->max != max_freq) + cpufreq_verify_within_limits(policy, 0, + max_freq); + } + mutex_unlock(&cooling_cpufreq_lock); + break; + default: + return NOTIFY_DONE; + } + + return NOTIFY_OK; +} + +/** + * build_dyn_power_table() - create a dynamic power to frequency table + * @cpufreq_device: the cpufreq cooling device in which to store the table + * @capacitance: dynamic power coefficient for these cpus + * + * Build a dynamic power to frequency table for this cpu and store it + * in @cpufreq_device. This table will be used in cpu_power_to_freq() and + * cpu_freq_to_power() to convert between power and frequency + * efficiently. Power is stored in mW, frequency in KHz. The + * resulting table is in ascending order. + * + * Return: 0 on success, -E* on error. + */ +static int build_dyn_power_table(struct cpufreq_cooling_device *cpufreq_device, + u32 capacitance) +{ + struct power_table *power_table; + struct dev_pm_opp *opp; + struct device *dev = NULL; + int num_opps = 0, cpu, i, ret = 0; + unsigned long freq; + + rcu_read_lock(); + + for_each_cpu(cpu, &cpufreq_device->allowed_cpus) { + dev = get_cpu_device(cpu); + if (!dev) { + dev_warn(&cpufreq_device->cool_dev->device, + "No cpu device for cpu %d\n", cpu); continue; + } + + num_opps = dev_pm_opp_get_opp_count(dev); + if (num_opps > 0) { + break; + } else if (num_opps < 0) { + ret = num_opps; + goto unlock; + } + } - max_freq = cpufreq_dev->cpufreq_val; + if (num_opps == 0) { + ret = -EINVAL; + goto unlock; + } - if (policy->max != max_freq) - cpufreq_verify_within_limits(policy, 0, max_freq); + power_table = kcalloc(num_opps, sizeof(*power_table), GFP_KERNEL); + if (!power_table) { + ret = -ENOMEM; + goto unlock; } - mutex_unlock(&cooling_cpufreq_lock); - return 0; + for (freq = 0, i = 0; + opp = dev_pm_opp_find_freq_ceil(dev, &freq), !IS_ERR(opp); + freq++, i++) { + u32 freq_mhz, voltage_mv; + u64 power; + + freq_mhz = freq / 1000000; + voltage_mv = dev_pm_opp_get_voltage(opp) / 1000; + + /* + * Do the multiplication with MHz and millivolt so as + * to not overflow. + */ + power = (u64)capacitance * freq_mhz * voltage_mv * voltage_mv; + do_div(power, 1000000000); + + /* frequency is stored in power_table in KHz */ + power_table[i].frequency = freq / 1000; + + /* power is stored in mW */ + power_table[i].power = power; + } + + if (i == 0) { + ret = PTR_ERR(opp); + goto unlock; + } + + cpufreq_device->cpu_dev = dev; + cpufreq_device->dyn_power_table = power_table; + cpufreq_device->dyn_power_table_entries = i; + +unlock: + rcu_read_unlock(); + return ret; +} + +static u32 cpu_freq_to_power(struct cpufreq_cooling_device *cpufreq_device, + u32 freq) +{ + int i; + struct power_table *pt = cpufreq_device->dyn_power_table; + + for (i = 1; i < cpufreq_device->dyn_power_table_entries; i++) + if (freq < pt[i].frequency) + break; + + return pt[i - 1].power; +} + +static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_device, + u32 power) +{ + int i; + struct power_table *pt = cpufreq_device->dyn_power_table; + + for (i = 1; i < cpufreq_device->dyn_power_table_entries; i++) + if (power < pt[i].power) + break; + + return pt[i - 1].frequency; +} + +/** + * get_load() - get load for a cpu since last updated + * @cpufreq_device: &struct cpufreq_cooling_device for this cpu + * @cpu: cpu number + * + * Return: The average load of cpu @cpu in percentage since this + * function was last called. + */ +static u32 get_load(struct cpufreq_cooling_device *cpufreq_device, int cpu) +{ + u32 load; + u64 now, now_idle, delta_time, delta_idle; + + now_idle = get_cpu_idle_time(cpu, &now, 0); + delta_idle = now_idle - cpufreq_device->time_in_idle[cpu]; + delta_time = now - cpufreq_device->time_in_idle_timestamp[cpu]; + + if (delta_time <= delta_idle) + load = 0; + else + load = div64_u64(100 * (delta_time - delta_idle), delta_time); + + cpufreq_device->time_in_idle[cpu] = now_idle; + cpufreq_device->time_in_idle_timestamp[cpu] = now; + + return load; +} + +/** + * get_static_power() - calculate the static power consumed by the cpus + * @cpufreq_device: struct &cpufreq_cooling_device for this cpu cdev + * @tz: thermal zone device in which we're operating + * @freq: frequency in KHz + * @power: pointer in which to store the calculated static power + * + * Calculate the static power consumed by the cpus described by + * @cpu_actor running at frequency @freq. This function relies on a + * platform specific function that should have been provided when the + * actor was registered. If it wasn't, the static power is assumed to + * be negligible. The calculated static power is stored in @power. + * + * Return: 0 on success, -E* on failure. + */ +static int get_static_power(struct cpufreq_cooling_device *cpufreq_device, + struct thermal_zone_device *tz, unsigned long freq, + u32 *power) +{ + struct dev_pm_opp *opp; + unsigned long voltage; + struct cpumask *cpumask = &cpufreq_device->allowed_cpus; + unsigned long freq_hz = freq * 1000; + + if (!cpufreq_device->plat_get_static_power || + !cpufreq_device->cpu_dev) { + *power = 0; + return 0; + } + + rcu_read_lock(); + + opp = dev_pm_opp_find_freq_exact(cpufreq_device->cpu_dev, freq_hz, + true); + voltage = dev_pm_opp_get_voltage(opp); + + rcu_read_unlock(); + + if (voltage == 0) { + dev_warn_ratelimited(cpufreq_device->cpu_dev, + "Failed to get voltage for frequency %lu: %ld\n", + freq_hz, IS_ERR(opp) ? PTR_ERR(opp) : 0); + return -EINVAL; + } + + return cpufreq_device->plat_get_static_power(cpumask, tz->passive_delay, + voltage, power); +} + +/** + * get_dynamic_power() - calculate the dynamic power + * @cpufreq_device: &cpufreq_cooling_device for this cdev + * @freq: current frequency + * + * Return: the dynamic power consumed by the cpus described by + * @cpufreq_device. + */ +static u32 get_dynamic_power(struct cpufreq_cooling_device *cpufreq_device, + unsigned long freq) +{ + u32 raw_cpu_power; + + raw_cpu_power = cpu_freq_to_power(cpufreq_device, freq); + return (raw_cpu_power * cpufreq_device->last_load) / 100; } /* cpufreq cooling device callback functions are defined below */ @@ -280,8 +526,205 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev, return 0; } +/** + * cpufreq_get_requested_power() - get the current power + * @cdev: &thermal_cooling_device pointer + * @tz: a valid thermal zone device pointer + * @power: pointer in which to store the resulting power + * + * Calculate the current power consumption of the cpus in milliwatts + * and store it in @power. This function should actually calculate + * the requested power, but it's hard to get the frequency that + * cpufreq would have assigned if there were no thermal limits. + * Instead, we calculate the current power on the assumption that the + * immediate future will look like the immediate past. + * + * We use the current frequency and the average load since this + * function was last called. In reality, there could have been + * multiple opps since this function was last called and that affects + * the load calculation. While it's not perfectly accurate, this + * simplification is good enough and works. REVISIT this, as more + * complex code may be needed if experiments show that it's not + * accurate enough. + * + * Return: 0 on success, -E* if getting the static power failed. + */ +static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev, + struct thermal_zone_device *tz, + u32 *power) +{ + unsigned long freq; + int i = 0, cpu, ret; + u32 static_power, dynamic_power, total_load = 0; + struct cpufreq_cooling_device *cpufreq_device = cdev->devdata; + u32 *load_cpu = NULL; + + cpu = cpumask_any_and(&cpufreq_device->allowed_cpus, cpu_online_mask); + + /* + * All the CPUs are offline, thus the requested power by + * the cdev is 0 + */ + if (cpu >= nr_cpu_ids) { + *power = 0; + return 0; + } + + freq = cpufreq_quick_get(cpu); + + if (trace_thermal_power_cpu_get_power_enabled()) { + u32 ncpus = cpumask_weight(&cpufreq_device->allowed_cpus); + + load_cpu = devm_kcalloc(&cdev->device, ncpus, sizeof(*load_cpu), + GFP_KERNEL); + } + + for_each_cpu(cpu, &cpufreq_device->allowed_cpus) { + u32 load; + + if (cpu_online(cpu)) + load = get_load(cpufreq_device, cpu); + else + load = 0; + + total_load += load; + if (trace_thermal_power_cpu_limit_enabled() && load_cpu) + load_cpu[i] = load; + + i++; + } + + cpufreq_device->last_load = total_load; + + dynamic_power = get_dynamic_power(cpufreq_device, freq); + ret = get_static_power(cpufreq_device, tz, freq, &static_power); + if (ret) { + if (load_cpu) + devm_kfree(&cdev->device, load_cpu); + return ret; + } + + if (load_cpu) { + trace_thermal_power_cpu_get_power( + &cpufreq_device->allowed_cpus, + freq, load_cpu, i, dynamic_power, static_power); + + devm_kfree(&cdev->device, load_cpu); + } + + *power = static_power + dynamic_power; + return 0; +} + +/** + * cpufreq_state2power() - convert a cpu cdev state to power consumed + * @cdev: &thermal_cooling_device pointer + * @tz: a valid thermal zone device pointer + * @state: cooling device state to be converted + * @power: pointer in which to store the resulting power + * + * Convert cooling device state @state into power consumption in + * milliwatts assuming 100% load. Store the calculated power in + * @power. + * + * Return: 0 on success, -EINVAL if the cooling device state could not + * be converted into a frequency or other -E* if there was an error + * when calculating the static power. + */ +static int cpufreq_state2power(struct thermal_cooling_device *cdev, + struct thermal_zone_device *tz, + unsigned long state, u32 *power) +{ + unsigned int freq, num_cpus; + cpumask_t cpumask; + u32 static_power, dynamic_power; + int ret; + struct cpufreq_cooling_device *cpufreq_device = cdev->devdata; + + cpumask_and(&cpumask, &cpufreq_device->allowed_cpus, cpu_online_mask); + num_cpus = cpumask_weight(&cpumask); + + /* None of our cpus are online, so no power */ + if (num_cpus == 0) { + *power = 0; + return 0; + } + + freq = cpufreq_device->freq_table[state]; + if (!freq) + return -EINVAL; + + dynamic_power = cpu_freq_to_power(cpufreq_device, freq) * num_cpus; + ret = get_static_power(cpufreq_device, tz, freq, &static_power); + if (ret) + return ret; + + *power = static_power + dynamic_power; + return 0; +} + +/** + * cpufreq_power2state() - convert power to a cooling device state + * @cdev: &thermal_cooling_device pointer + * @tz: a valid thermal zone device pointer + * @power: power in milliwatts to be converted + * @state: pointer in which to store the resulting state + * + * Calculate a cooling device state for the cpus described by @cdev + * that would allow them to consume at most @power mW and store it in + * @state. Note that this calculation depends on external factors + * such as the cpu load or the current static power. Calling this + * function with the same power as input can yield different cooling + * device states depending on those external factors. + * + * Return: 0 on success, -ENODEV if no cpus are online or -EINVAL if + * the calculated frequency could not be converted to a valid state. + * The latter should not happen unless the frequencies available to + * cpufreq have changed since the initialization of the cpu cooling + * device. + */ +static int cpufreq_power2state(struct thermal_cooling_device *cdev, + struct thermal_zone_device *tz, u32 power, + unsigned long *state) +{ + unsigned int cpu, cur_freq, target_freq; + int ret; + s32 dyn_power; + u32 last_load, normalised_power, static_power; + struct cpufreq_cooling_device *cpufreq_device = cdev->devdata; + + cpu = cpumask_any_and(&cpufreq_device->allowed_cpus, cpu_online_mask); + + /* None of our cpus are online */ + if (cpu >= nr_cpu_ids) + return -ENODEV; + + cur_freq = cpufreq_quick_get(cpu); + ret = get_static_power(cpufreq_device, tz, cur_freq, &static_power); + if (ret) + return ret; + + dyn_power = power - static_power; + dyn_power = dyn_power > 0 ? dyn_power : 0; + last_load = cpufreq_device->last_load ?: 1; + normalised_power = (dyn_power * 100) / last_load; + target_freq = cpu_power_to_freq(cpufreq_device, normalised_power); + + *state = cpufreq_cooling_get_level(cpu, target_freq); + if (*state == THERMAL_CSTATE_INVALID) { + dev_warn_ratelimited(&cdev->device, + "Failed to convert %dKHz for cpu %d into a cdev state\n", + target_freq, cpu); + return -EINVAL; + } + + trace_thermal_power_cpu_limit(&cpufreq_device->allowed_cpus, + target_freq, *state, power); + return 0; +} + /* Bind cpufreq callbacks to thermal cooling device ops */ -static struct thermal_cooling_device_ops const cpufreq_cooling_ops = { +static struct thermal_cooling_device_ops cpufreq_cooling_ops = { .get_max_state = cpufreq_get_max_state, .get_cur_state = cpufreq_get_cur_state, .set_cur_state = cpufreq_set_cur_state, @@ -311,6 +754,9 @@ static unsigned int find_next_max(struct cpufreq_frequency_table *table, * @np: a valid struct device_node to the cooling device device tree node * @clip_cpus: cpumask of cpus where the frequency constraints will happen. * Normally this should be same as cpufreq policy->related_cpus. + * @capacitance: dynamic power coefficient for these cpus + * @plat_static_func: function to calculate the static power consumed by these + * cpus (optional) * * This interface function registers the cpufreq cooling device with the name * "thermal-cpufreq-%x". This api can support multiple instances of cpufreq @@ -322,13 +768,14 @@ static unsigned int find_next_max(struct cpufreq_frequency_table *table, */ static struct thermal_cooling_device * __cpufreq_cooling_register(struct device_node *np, - const struct cpumask *clip_cpus) + const struct cpumask *clip_cpus, u32 capacitance, + get_static_t plat_static_func) { struct thermal_cooling_device *cool_dev; struct cpufreq_cooling_device *cpufreq_dev; char dev_name[THERMAL_NAME_LENGTH]; struct cpufreq_frequency_table *pos, *table; - unsigned int freq, i; + unsigned int freq, i, num_cpus; int ret; table = cpufreq_frequency_get_table(cpumask_first(clip_cpus)); @@ -341,6 +788,23 @@ __cpufreq_cooling_register(struct device_node *np, if (!cpufreq_dev) return ERR_PTR(-ENOMEM); + num_cpus = cpumask_weight(clip_cpus); + cpufreq_dev->time_in_idle = kcalloc(num_cpus, + sizeof(*cpufreq_dev->time_in_idle), + GFP_KERNEL); + if (!cpufreq_dev->time_in_idle) { + cool_dev = ERR_PTR(-ENOMEM); + goto free_cdev; + } + + cpufreq_dev->time_in_idle_timestamp = + kcalloc(num_cpus, sizeof(*cpufreq_dev->time_in_idle_timestamp), + GFP_KERNEL); + if (!cpufreq_dev->time_in_idle_timestamp) { + cool_dev = ERR_PTR(-ENOMEM); + goto free_time_in_idle; + } + /* Find max levels */ cpufreq_for_each_valid_entry(pos, table) cpufreq_dev->max_level++; @@ -349,7 +813,7 @@ __cpufreq_cooling_register(struct device_node *np, cpufreq_dev->max_level, GFP_KERNEL); if (!cpufreq_dev->freq_table) { cool_dev = ERR_PTR(-ENOMEM); - goto free_cdev; + goto free_time_in_idle_timestamp; } /* max_level is an index, not a counter */ @@ -357,6 +821,20 @@ __cpufreq_cooling_register(struct device_node *np, cpumask_copy(&cpufreq_dev->allowed_cpus, clip_cpus); + if (capacitance) { + cpufreq_cooling_ops.get_requested_power = + cpufreq_get_requested_power; + cpufreq_cooling_ops.state2power = cpufreq_state2power; + cpufreq_cooling_ops.power2state = cpufreq_power2state; + cpufreq_dev->plat_get_static_power = plat_static_func; + + ret = build_dyn_power_table(cpufreq_dev, capacitance); + if (ret) { + cool_dev = ERR_PTR(ret); + goto free_table; + } + } + ret = get_idr(&cpufreq_idr, &cpufreq_dev->id); if (ret) { cool_dev = ERR_PTR(ret); @@ -402,6 +880,10 @@ remove_idr: release_idr(&cpufreq_idr, cpufreq_dev->id); free_table: kfree(cpufreq_dev->freq_table); +free_time_in_idle_timestamp: + kfree(cpufreq_dev->time_in_idle_timestamp); +free_time_in_idle: + kfree(cpufreq_dev->time_in_idle); free_cdev: kfree(cpufreq_dev); @@ -422,7 +904,7 @@ free_cdev: struct thermal_cooling_device * cpufreq_cooling_register(const struct cpumask *clip_cpus) { - return __cpufreq_cooling_register(NULL, clip_cpus); + return __cpufreq_cooling_register(NULL, clip_cpus, 0, NULL); } EXPORT_SYMBOL_GPL(cpufreq_cooling_register); @@ -446,11 +928,78 @@ of_cpufreq_cooling_register(struct device_node *np, if (!np) return ERR_PTR(-EINVAL); - return __cpufreq_cooling_register(np, clip_cpus); + return __cpufreq_cooling_register(np, clip_cpus, 0, NULL); } EXPORT_SYMBOL_GPL(of_cpufreq_cooling_register); /** + * cpufreq_power_cooling_register() - create cpufreq cooling device with power extensions + * @clip_cpus: cpumask of cpus where the frequency constraints will happen + * @capacitance: dynamic power coefficient for these cpus + * @plat_static_func: function to calculate the static power consumed by these + * cpus (optional) + * + * This interface function registers the cpufreq cooling device with + * the name "thermal-cpufreq-%x". This api can support multiple + * instances of cpufreq cooling devices. Using this function, the + * cooling device will implement the power extensions by using a + * simple cpu power model. The cpus must have registered their OPPs + * using the OPP library. + * + * An optional @plat_static_func may be provided to calculate the + * static power consumed by these cpus. If the platform's static + * power consumption is unknown or negligible, make it NULL. + * + * Return: a valid struct thermal_cooling_device pointer on success, + * on failure, it returns a corresponding ERR_PTR(). + */ +struct thermal_cooling_device * +cpufreq_power_cooling_register(const struct cpumask *clip_cpus, u32 capacitance, + get_static_t plat_static_func) +{ + return __cpufreq_cooling_register(NULL, clip_cpus, capacitance, + plat_static_func); +} +EXPORT_SYMBOL(cpufreq_power_cooling_register); + +/** + * of_cpufreq_power_cooling_register() - create cpufreq cooling device with power extensions + * @np: a valid struct device_node to the cooling device device tree node + * @clip_cpus: cpumask of cpus where the frequency constraints will happen + * @capacitance: dynamic power coefficient for these cpus + * @plat_static_func: function to calculate the static power consumed by these + * cpus (optional) + * + * This interface function registers the cpufreq cooling device with + * the name "thermal-cpufreq-%x". This api can support multiple + * instances of cpufreq cooling devices. Using this API, the cpufreq + * cooling device will be linked to the device tree node provided. + * Using this function, the cooling device will implement the power + * extensions by using a simple cpu power model. The cpus must have + * registered their OPPs using the OPP library. + * + * An optional @plat_static_func may be provided to calculate the + * static power consumed by these cpus. If the platform's static + * power consumption is unknown or negligible, make it NULL. + * + * Return: a valid struct thermal_cooling_device pointer on success, + * on failure, it returns a corresponding ERR_PTR(). + */ +struct thermal_cooling_device * +of_cpufreq_power_cooling_register(struct device_node *np, + const struct cpumask *clip_cpus, + u32 capacitance, + get_static_t plat_static_func) +{ + if (!np) + return ERR_PTR(-EINVAL); + + return __cpufreq_cooling_register(np, clip_cpus, capacitance, + plat_static_func); +} +EXPORT_SYMBOL(of_cpufreq_power_cooling_register); + +/** * cpufreq_cooling_unregister - function to remove cpufreq cooling device. * @cdev: thermal cooling device pointer. * @@ -475,6 +1024,8 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev) thermal_cooling_device_unregister(cpufreq_dev->cool_dev); release_idr(&cpufreq_idr, cpufreq_dev->id); + kfree(cpufreq_dev->time_in_idle_timestamp); + kfree(cpufreq_dev->time_in_idle); kfree(cpufreq_dev->freq_table); kfree(cpufreq_dev); } diff --git a/drivers/thermal/db8500_thermal.c b/drivers/thermal/db8500_thermal.c index 20adfbe27df1..2fb273c4baa9 100644 --- a/drivers/thermal/db8500_thermal.c +++ b/drivers/thermal/db8500_thermal.c @@ -76,7 +76,7 @@ static int db8500_cdev_bind(struct thermal_zone_device *thermal, upper = lower = i > max_state ? max_state : i; ret = thermal_zone_bind_cooling_device(thermal, i, cdev, - upper, lower); + upper, lower, THERMAL_WEIGHT_DEFAULT); dev_info(&cdev->device, "%s bind to %d: %d-%s\n", cdev->type, i, ret, ret ? "fail" : "succeed"); diff --git a/drivers/thermal/fair_share.c b/drivers/thermal/fair_share.c index 6e0a3fbfae86..8c50b8d6afb7 100644 --- a/drivers/thermal/fair_share.c +++ b/drivers/thermal/fair_share.c @@ -59,13 +59,13 @@ static int get_trip_level(struct thermal_zone_device *tz) } static long get_target_state(struct thermal_zone_device *tz, - struct thermal_cooling_device *cdev, int weight, int level) + struct thermal_cooling_device *cdev, int percentage, int level) { unsigned long max_state; cdev->ops->get_max_state(cdev, &max_state); - return (long)(weight * level * max_state) / (100 * tz->trips); + return (long)(percentage * level * max_state) / (100 * tz->trips); } /** @@ -77,7 +77,7 @@ static long get_target_state(struct thermal_zone_device *tz, * * Parameters used for Throttling: * P1. max_state: Maximum throttle state exposed by the cooling device. - * P2. weight[i]/100: + * P2. percentage[i]/100: * How 'effective' the 'i'th device is, in cooling the given zone. * P3. cur_trip_level/max_no_of_trips: * This describes the extent to which the devices should be throttled. @@ -88,28 +88,33 @@ static long get_target_state(struct thermal_zone_device *tz, */ static int fair_share_throttle(struct thermal_zone_device *tz, int trip) { - const struct thermal_zone_params *tzp; - struct thermal_cooling_device *cdev; struct thermal_instance *instance; - int i; + int total_weight = 0; + int total_instance = 0; int cur_trip_level = get_trip_level(tz); - if (!tz->tzp || !tz->tzp->tbp) - return -EINVAL; + list_for_each_entry(instance, &tz->thermal_instances, tz_node) { + if (instance->trip != trip) + continue; + + total_weight += instance->weight; + total_instance++; + } - tzp = tz->tzp; + list_for_each_entry(instance, &tz->thermal_instances, tz_node) { + int percentage; + struct thermal_cooling_device *cdev = instance->cdev; - for (i = 0; i < tzp->num_tbps; i++) { - if (!tzp->tbp[i].cdev) + if (instance->trip != trip) continue; - cdev = tzp->tbp[i].cdev; - instance = get_thermal_instance(tz, cdev, trip); - if (!instance) - continue; + if (!total_weight) + percentage = 100 / total_instance; + else + percentage = (instance->weight * 100) / total_weight; - instance->target = get_target_state(tz, cdev, - tzp->tbp[i].weight, cur_trip_level); + instance->target = get_target_state(tz, cdev, percentage, + cur_trip_level); instance->cdev->updated = false; thermal_cdev_update(cdev); diff --git a/drivers/thermal/imx_thermal.c b/drivers/thermal/imx_thermal.c index 2ccbc0788353..fde4c2876d14 100644 --- a/drivers/thermal/imx_thermal.c +++ b/drivers/thermal/imx_thermal.c @@ -306,7 +306,8 @@ static int imx_bind(struct thermal_zone_device *tz, ret = thermal_zone_bind_cooling_device(tz, IMX_TRIP_PASSIVE, cdev, THERMAL_NO_LIMIT, - THERMAL_NO_LIMIT); + THERMAL_NO_LIMIT, + THERMAL_WEIGHT_DEFAULT); if (ret) { dev_err(&tz->device, "binding zone %s with cdev %s failed:%d\n", diff --git a/drivers/thermal/of-thermal.c b/drivers/thermal/of-thermal.c index 668fb1bdea9e..b295b2b6c191 100644 --- a/drivers/thermal/of-thermal.c +++ b/drivers/thermal/of-thermal.c @@ -58,6 +58,8 @@ struct __thermal_bind_params { * @mode: current thermal zone device mode (enabled/disabled) * @passive_delay: polling interval while passive cooling is activated * @polling_delay: zone polling interval + * @slope: slope of the temperature adjustment curve + * @offset: offset of the temperature adjustment curve * @ntrips: number of trip points * @trips: an array of trip points (0..ntrips - 1) * @num_tbps: number of thermal bind params @@ -70,6 +72,8 @@ struct __thermal_zone { enum thermal_device_mode mode; int passive_delay; int polling_delay; + int slope; + int offset; /* trip data */ int ntrips; @@ -227,7 +231,8 @@ static int of_thermal_bind(struct thermal_zone_device *thermal, ret = thermal_zone_bind_cooling_device(thermal, tbp->trip_id, cdev, tbp->max, - tbp->min); + tbp->min, + tbp->usage); if (ret) return ret; } @@ -581,7 +586,7 @@ static int thermal_of_populate_bind_params(struct device_node *np, u32 prop; /* Default weight. Usage is optional */ - __tbp->usage = 0; + __tbp->usage = THERMAL_WEIGHT_DEFAULT; ret = of_property_read_u32(np, "contribution", &prop); if (ret == 0) __tbp->usage = prop; @@ -715,7 +720,7 @@ static int thermal_of_populate_trip(struct device_node *np, * @np parameter and fills the read data into a __thermal_zone data structure * and return this pointer. * - * TODO: Missing properties to parse: thermal-sensor-names and coefficients + * TODO: Missing properties to parse: thermal-sensor-names * * Return: On success returns a valid struct __thermal_zone, * otherwise, it returns a corresponding ERR_PTR(). Caller must @@ -727,7 +732,7 @@ thermal_of_build_thermal_zone(struct device_node *np) struct device_node *child = NULL, *gchild; struct __thermal_zone *tz; int ret, i; - u32 prop; + u32 prop, coef[2]; if (!np) { pr_err("no thermal zone np\n"); @@ -752,6 +757,20 @@ thermal_of_build_thermal_zone(struct device_node *np) } tz->polling_delay = prop; + /* + * REVIST: for now, the thermal framework supports only + * one sensor per thermal zone. Thus, we are considering + * only the first two values as slope and offset. + */ + ret = of_property_read_u32_array(np, "coefficients", coef, 2); + if (ret == 0) { + tz->slope = coef[0]; + tz->offset = coef[1]; + } else { + tz->slope = 1; + tz->offset = 0; + } + /* trips */ child = of_get_child_by_name(np, "trips"); @@ -865,6 +884,8 @@ int __init of_parse_thermal_zones(void) for_each_child_of_node(np, child) { struct thermal_zone_device *zone; struct thermal_zone_params *tzp; + int i, mask = 0; + u32 prop; /* Check whether child is enabled or not */ if (!of_device_is_available(child)) @@ -891,8 +912,18 @@ int __init of_parse_thermal_zones(void) /* No hwmon because there might be hwmon drivers registering */ tzp->no_hwmon = true; + if (!of_property_read_u32(child, "sustainable-power", &prop)) + tzp->sustainable_power = prop; + + for (i = 0; i < tz->ntrips; i++) + mask |= 1 << i; + + /* these two are left for temperature drivers to use */ + tzp->slope = tz->slope; + tzp->offset = tz->offset; + zone = thermal_zone_device_register(child->name, tz->ntrips, - 0, tz, + mask, tz, ops, tzp, tz->passive_delay, tz->polling_delay); diff --git a/drivers/thermal/power_allocator.c b/drivers/thermal/power_allocator.c new file mode 100644 index 000000000000..251676902869 --- /dev/null +++ b/drivers/thermal/power_allocator.c @@ -0,0 +1,544 @@ +/* + * A power allocator to manage temperature + * + * Copyright (C) 2014 ARM Ltd. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed "as is" WITHOUT ANY WARRANTY of any + * kind, whether express or implied; without even the implied warranty + * of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + */ + +#define pr_fmt(fmt) "Power allocator: " fmt + +#include <linux/rculist.h> +#include <linux/slab.h> +#include <linux/thermal.h> + +#define CREATE_TRACE_POINTS +#include <trace/events/thermal_power_allocator.h> + +#include "thermal_core.h" + +#define FRAC_BITS 10 +#define int_to_frac(x) ((x) << FRAC_BITS) +#define frac_to_int(x) ((x) >> FRAC_BITS) + +/** + * mul_frac() - multiply two fixed-point numbers + * @x: first multiplicand + * @y: second multiplicand + * + * Return: the result of multiplying two fixed-point numbers. The + * result is also a fixed-point number. + */ +static inline s64 mul_frac(s64 x, s64 y) +{ + return (x * y) >> FRAC_BITS; +} + +/** + * div_frac() - divide two fixed-point numbers + * @x: the dividend + * @y: the divisor + * + * Return: the result of dividing two fixed-point numbers. The + * result is also a fixed-point number. + */ +static inline s64 div_frac(s64 x, s64 y) +{ + return div_s64(x << FRAC_BITS, y); +} + +/** + * struct power_allocator_params - parameters for the power allocator governor + * @err_integral: accumulated error in the PID controller. + * @prev_err: error in the previous iteration of the PID controller. + * Used to calculate the derivative term. + * @trip_switch_on: first passive trip point of the thermal zone. The + * governor switches on when this trip point is crossed. + * @trip_max_desired_temperature: last passive trip point of the thermal + * zone. The temperature we are + * controlling for. + */ +struct power_allocator_params { + s64 err_integral; + s32 prev_err; + int trip_switch_on; + int trip_max_desired_temperature; +}; + +/** + * pid_controller() - PID controller + * @tz: thermal zone we are operating in + * @current_temp: the current temperature in millicelsius + * @control_temp: the target temperature in millicelsius + * @max_allocatable_power: maximum allocatable power for this thermal zone + * + * This PID controller increases the available power budget so that the + * temperature of the thermal zone gets as close as possible to + * @control_temp and limits the power if it exceeds it. k_po is the + * proportional term when we are overshooting, k_pu is the + * proportional term when we are undershooting. integral_cutoff is a + * threshold below which we stop accumulating the error. The + * accumulated error is only valid if the requested power will make + * the system warmer. If the system is mostly idle, there's no point + * in accumulating positive error. + * + * Return: The power budget for the next period. + */ +static u32 pid_controller(struct thermal_zone_device *tz, + unsigned long current_temp, + unsigned long control_temp, + u32 max_allocatable_power) +{ + s64 p, i, d, power_range; + s32 err, max_power_frac; + struct power_allocator_params *params = tz->governor_data; + + max_power_frac = int_to_frac(max_allocatable_power); + + err = ((s32)control_temp - (s32)current_temp); + err = int_to_frac(err); + + /* Calculate the proportional term */ + p = mul_frac(err < 0 ? tz->tzp->k_po : tz->tzp->k_pu, err); + + /* + * Calculate the integral term + * + * if the error is less than cut off allow integration (but + * the integral is limited to max power) + */ + i = mul_frac(tz->tzp->k_i, params->err_integral); + + if (err < int_to_frac(tz->tzp->integral_cutoff)) { + s64 i_next = i + mul_frac(tz->tzp->k_i, err); + + if (abs64(i_next) < max_power_frac) { + i = i_next; + params->err_integral += err; + } + } + + /* + * Calculate the derivative term + * + * We do err - prev_err, so with a positive k_d, a decreasing + * error (i.e. driving closer to the line) results in less + * power being applied, slowing down the controller) + */ + d = mul_frac(tz->tzp->k_d, err - params->prev_err); + d = div_frac(d, tz->passive_delay); + params->prev_err = err; + + power_range = p + i + d; + + /* feed-forward the known sustainable dissipatable power */ + power_range = tz->tzp->sustainable_power + frac_to_int(power_range); + + power_range = clamp(power_range, (s64)0, (s64)max_allocatable_power); + + trace_thermal_power_allocator_pid(tz, frac_to_int(err), + frac_to_int(params->err_integral), + frac_to_int(p), frac_to_int(i), + frac_to_int(d), power_range); + + return power_range; +} + +/** + * divvy_up_power() - divvy the allocated power between the actors + * @req_power: each actor's requested power + * @max_power: each actor's maximum available power + * @num_actors: size of the @req_power, @max_power and @granted_power's array + * @total_req_power: sum of @req_power + * @power_range: total allocated power + * @granted_power: output array: each actor's granted power + * @extra_actor_power: an appropriately sized array to be used in the + * function as temporary storage of the extra power given + * to the actors + * + * This function divides the total allocated power (@power_range) + * fairly between the actors. It first tries to give each actor a + * share of the @power_range according to how much power it requested + * compared to the rest of the actors. For example, if only one actor + * requests power, then it receives all the @power_range. If + * three actors each requests 1mW, each receives a third of the + * @power_range. + * + * If any actor received more than their maximum power, then that + * surplus is re-divvied among the actors based on how far they are + * from their respective maximums. + * + * Granted power for each actor is written to @granted_power, which + * should've been allocated by the calling function. + */ +static void divvy_up_power(u32 *req_power, u32 *max_power, int num_actors, + u32 total_req_power, u32 power_range, + u32 *granted_power, u32 *extra_actor_power) +{ + u32 extra_power, capped_extra_power; + int i; + + /* + * Prevent division by 0 if none of the actors request power. + */ + if (!total_req_power) + total_req_power = 1; + + capped_extra_power = 0; + extra_power = 0; + for (i = 0; i < num_actors; i++) { + u64 req_range = req_power[i] * power_range; + + granted_power[i] = DIV_ROUND_CLOSEST_ULL(req_range, + total_req_power); + + if (granted_power[i] > max_power[i]) { + extra_power += granted_power[i] - max_power[i]; + granted_power[i] = max_power[i]; + } + + extra_actor_power[i] = max_power[i] - granted_power[i]; + capped_extra_power += extra_actor_power[i]; + } + + if (!extra_power) + return; + + /* + * Re-divvy the reclaimed extra among actors based on + * how far they are from the max + */ + extra_power = min(extra_power, capped_extra_power); + if (capped_extra_power > 0) + for (i = 0; i < num_actors; i++) + granted_power[i] += (extra_actor_power[i] * + extra_power) / capped_extra_power; +} + +static int allocate_power(struct thermal_zone_device *tz, + unsigned long current_temp, + unsigned long control_temp) +{ + struct thermal_instance *instance; + struct power_allocator_params *params = tz->governor_data; + u32 *req_power, *max_power, *granted_power, *extra_actor_power; + u32 *weighted_req_power; + u32 total_req_power, max_allocatable_power, total_weighted_req_power; + u32 total_granted_power, power_range; + int i, num_actors, total_weight, ret = 0; + int trip_max_desired_temperature = params->trip_max_desired_temperature; + + mutex_lock(&tz->lock); + + num_actors = 0; + total_weight = 0; + list_for_each_entry(instance, &tz->thermal_instances, tz_node) { + if ((instance->trip == trip_max_desired_temperature) && + cdev_is_power_actor(instance->cdev)) { + num_actors++; + total_weight += instance->weight; + } + } + + /* + * We need to allocate five arrays of the same size: + * req_power, max_power, granted_power, extra_actor_power and + * weighted_req_power. They are going to be needed until this + * function returns. Allocate them all in one go to simplify + * the allocation and deallocation logic. + */ + BUILD_BUG_ON(sizeof(*req_power) != sizeof(*max_power)); + BUILD_BUG_ON(sizeof(*req_power) != sizeof(*granted_power)); + BUILD_BUG_ON(sizeof(*req_power) != sizeof(*extra_actor_power)); + BUILD_BUG_ON(sizeof(*req_power) != sizeof(*weighted_req_power)); + req_power = kcalloc(num_actors * 5, sizeof(*req_power), GFP_KERNEL); + if (!req_power) { + ret = -ENOMEM; + goto unlock; + } + + max_power = &req_power[num_actors]; + granted_power = &req_power[2 * num_actors]; + extra_actor_power = &req_power[3 * num_actors]; + weighted_req_power = &req_power[4 * num_actors]; + + i = 0; + total_weighted_req_power = 0; + total_req_power = 0; + max_allocatable_power = 0; + + list_for_each_entry(instance, &tz->thermal_instances, tz_node) { + int weight; + struct thermal_cooling_device *cdev = instance->cdev; + + if (instance->trip != trip_max_desired_temperature) + continue; + + if (!cdev_is_power_actor(cdev)) + continue; + + if (cdev->ops->get_requested_power(cdev, tz, &req_power[i])) + continue; + + if (!total_weight) + weight = 1 << FRAC_BITS; + else + weight = instance->weight; + + weighted_req_power[i] = frac_to_int(weight * req_power[i]); + + if (power_actor_get_max_power(cdev, tz, &max_power[i])) + continue; + + total_req_power += req_power[i]; + max_allocatable_power += max_power[i]; + total_weighted_req_power += weighted_req_power[i]; + + i++; + } + + power_range = pid_controller(tz, current_temp, control_temp, + max_allocatable_power); + + divvy_up_power(weighted_req_power, max_power, num_actors, + total_weighted_req_power, power_range, granted_power, + extra_actor_power); + + total_granted_power = 0; + i = 0; + list_for_each_entry(instance, &tz->thermal_instances, tz_node) { + if (instance->trip != trip_max_desired_temperature) + continue; + + if (!cdev_is_power_actor(instance->cdev)) + continue; + + power_actor_set_power(instance->cdev, instance, + granted_power[i]); + total_granted_power += granted_power[i]; + + i++; + } + + trace_thermal_power_allocator(tz, req_power, total_req_power, + granted_power, total_granted_power, + num_actors, power_range, + max_allocatable_power, current_temp, + (s32)control_temp - (s32)current_temp); + + kfree(req_power); +unlock: + mutex_unlock(&tz->lock); + + return ret; +} + +static int get_governor_trips(struct thermal_zone_device *tz, + struct power_allocator_params *params) +{ + int i, ret, last_passive; + bool found_first_passive; + + found_first_passive = false; + last_passive = -1; + ret = -EINVAL; + + for (i = 0; i < tz->trips; i++) { + enum thermal_trip_type type; + + ret = tz->ops->get_trip_type(tz, i, &type); + if (ret) + return ret; + + if (!found_first_passive) { + if (type == THERMAL_TRIP_PASSIVE) { + params->trip_switch_on = i; + found_first_passive = true; + } + } else if (type == THERMAL_TRIP_PASSIVE) { + last_passive = i; + } else { + break; + } + } + + if (last_passive != -1) { + params->trip_max_desired_temperature = last_passive; + ret = 0; + } else { + ret = -EINVAL; + } + + return ret; +} + +static void reset_pid_controller(struct power_allocator_params *params) +{ + params->err_integral = 0; + params->prev_err = 0; +} + +static void allow_maximum_power(struct thermal_zone_device *tz) +{ + struct thermal_instance *instance; + struct power_allocator_params *params = tz->governor_data; + + list_for_each_entry(instance, &tz->thermal_instances, tz_node) { + if ((instance->trip != params->trip_max_desired_temperature) || + (!cdev_is_power_actor(instance->cdev))) + continue; + + instance->target = 0; + instance->cdev->updated = false; + thermal_cdev_update(instance->cdev); + } +} + +/** + * power_allocator_bind() - bind the power_allocator governor to a thermal zone + * @tz: thermal zone to bind it to + * + * Check that the thermal zone is valid for this governor, that is, it + * has two thermal trips. If so, initialize the PID controller + * parameters and bind it to the thermal zone. + * + * Return: 0 on success, -EINVAL if the trips were invalid or -ENOMEM + * if we ran out of memory. + */ +static int power_allocator_bind(struct thermal_zone_device *tz) +{ + int ret; + struct power_allocator_params *params; + unsigned long switch_on_temp, control_temp; + u32 temperature_threshold; + + if (!tz->tzp || !tz->tzp->sustainable_power) { + dev_err(&tz->device, + "power_allocator: missing sustainable_power\n"); + return -EINVAL; + } + + params = kzalloc(sizeof(*params), GFP_KERNEL); + if (!params) + return -ENOMEM; + + ret = get_governor_trips(tz, params); + if (ret) { + dev_err(&tz->device, + "thermal zone %s has wrong trip setup for power allocator\n", + tz->type); + goto free; + } + + ret = tz->ops->get_trip_temp(tz, params->trip_switch_on, + &switch_on_temp); + if (ret) + goto free; + + ret = tz->ops->get_trip_temp(tz, params->trip_max_desired_temperature, + &control_temp); + if (ret) + goto free; + + temperature_threshold = control_temp - switch_on_temp; + + tz->tzp->k_po = tz->tzp->k_po ?: + int_to_frac(tz->tzp->sustainable_power) / temperature_threshold; + tz->tzp->k_pu = tz->tzp->k_pu ?: + int_to_frac(2 * tz->tzp->sustainable_power) / + temperature_threshold; + tz->tzp->k_i = tz->tzp->k_i ?: int_to_frac(10) / 1000; + /* + * The default for k_d and integral_cutoff is 0, so we can + * leave them as they are. + */ + + reset_pid_controller(params); + + tz->governor_data = params; + + return 0; + +free: + kfree(params); + return ret; +} + +static void power_allocator_unbind(struct thermal_zone_device *tz) +{ + dev_dbg(&tz->device, "Unbinding from thermal zone %d\n", tz->id); + kfree(tz->governor_data); + tz->governor_data = NULL; +} + +static int power_allocator_throttle(struct thermal_zone_device *tz, int trip) +{ + int ret; + unsigned long switch_on_temp, control_temp, current_temp; + struct power_allocator_params *params = tz->governor_data; + + /* + * We get called for every trip point but we only need to do + * our calculations once + */ + if (trip != params->trip_max_desired_temperature) + return 0; + + ret = thermal_zone_get_temp(tz, ¤t_temp); + if (ret) { + dev_warn(&tz->device, "Failed to get temperature: %d\n", ret); + return ret; + } + + ret = tz->ops->get_trip_temp(tz, params->trip_switch_on, + &switch_on_temp); + if (ret) { + dev_warn(&tz->device, + "Failed to get switch on temperature: %d\n", ret); + return ret; + } + + if (current_temp < switch_on_temp) { + tz->passive = 0; + reset_pid_controller(params); + allow_maximum_power(tz); + return 0; + } + + tz->passive = 1; + + ret = tz->ops->get_trip_temp(tz, params->trip_max_desired_temperature, + &control_temp); + if (ret) { + dev_warn(&tz->device, + "Failed to get the maximum desired temperature: %d\n", + ret); + return ret; + } + + return allocate_power(tz, current_temp, control_temp); +} + +static struct thermal_governor thermal_gov_power_allocator = { + .name = "power_allocator", + .bind_to_tz = power_allocator_bind, + .unbind_from_tz = power_allocator_unbind, + .throttle = power_allocator_throttle, +}; + +int thermal_gov_power_allocator_register(void) +{ + return thermal_register_governor(&thermal_gov_power_allocator); +} + +void thermal_gov_power_allocator_unregister(void) +{ + thermal_unregister_governor(&thermal_gov_power_allocator); +} diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c index a3282bfb343d..d060ca95fd09 100644 --- a/drivers/thermal/thermal_core.c +++ b/drivers/thermal/thermal_core.c @@ -78,6 +78,58 @@ static struct thermal_governor *__find_governor(const char *name) return NULL; } +/** + * bind_previous_governor() - bind the previous governor of the thermal zone + * @tz: a valid pointer to a struct thermal_zone_device + * @failed_gov_name: the name of the governor that failed to register + * + * Register the previous governor of the thermal zone after a new + * governor has failed to be bound. + */ +static void bind_previous_governor(struct thermal_zone_device *tz, + const char *failed_gov_name) +{ + if (tz->governor && tz->governor->bind_to_tz) { + if (tz->governor->bind_to_tz(tz)) { + dev_err(&tz->device, + "governor %s failed to bind and the previous one (%s) failed to bind again, thermal zone %s has no governor\n", + failed_gov_name, tz->governor->name, tz->type); + tz->governor = NULL; + } + } +} + +/** + * thermal_set_governor() - Switch to another governor + * @tz: a valid pointer to a struct thermal_zone_device + * @new_gov: pointer to the new governor + * + * Change the governor of thermal zone @tz. + * + * Return: 0 on success, an error if the new governor's bind_to_tz() failed. + */ +static int thermal_set_governor(struct thermal_zone_device *tz, + struct thermal_governor *new_gov) +{ + int ret = 0; + + if (tz->governor && tz->governor->unbind_from_tz) + tz->governor->unbind_from_tz(tz); + + if (new_gov && new_gov->bind_to_tz) { + ret = new_gov->bind_to_tz(tz); + if (ret) { + bind_previous_governor(tz, new_gov->name); + + return ret; + } + } + + tz->governor = new_gov; + + return ret; +} + int thermal_register_governor(struct thermal_governor *governor) { int err; @@ -110,8 +162,15 @@ int thermal_register_governor(struct thermal_governor *governor) name = pos->tzp->governor_name; - if (!strncasecmp(name, governor->name, THERMAL_NAME_LENGTH)) - pos->governor = governor; + if (!strncasecmp(name, governor->name, THERMAL_NAME_LENGTH)) { + int ret; + + ret = thermal_set_governor(pos, governor); + if (ret) + dev_err(&pos->device, + "Failed to set governor %s for thermal zone %s: %d\n", + governor->name, pos->type, ret); + } } mutex_unlock(&thermal_list_lock); @@ -137,7 +196,7 @@ void thermal_unregister_governor(struct thermal_governor *governor) list_for_each_entry(pos, &thermal_tz_list, node) { if (!strncasecmp(pos->governor->name, governor->name, THERMAL_NAME_LENGTH)) - pos->governor = NULL; + thermal_set_governor(pos, NULL); } mutex_unlock(&thermal_list_lock); @@ -221,7 +280,8 @@ static void print_bind_err_msg(struct thermal_zone_device *tz, static void __bind(struct thermal_zone_device *tz, int mask, struct thermal_cooling_device *cdev, - unsigned long *limits) + unsigned long *limits, + unsigned int weight) { int i, ret; @@ -236,7 +296,8 @@ static void __bind(struct thermal_zone_device *tz, int mask, upper = limits[i * 2 + 1]; } ret = thermal_zone_bind_cooling_device(tz, i, cdev, - upper, lower); + upper, lower, + weight); if (ret) print_bind_err_msg(tz, cdev, ret); } @@ -283,7 +344,8 @@ static void bind_cdev(struct thermal_cooling_device *cdev) continue; tzp->tbp[i].cdev = cdev; __bind(pos, tzp->tbp[i].trip_mask, cdev, - tzp->tbp[i].binding_limits); + tzp->tbp[i].binding_limits, + tzp->tbp[i].weight); } } @@ -322,7 +384,8 @@ static void bind_tz(struct thermal_zone_device *tz) continue; tzp->tbp[i].cdev = pos; __bind(tz, tzp->tbp[i].trip_mask, pos, - tzp->tbp[i].binding_limits); + tzp->tbp[i].binding_limits, + tzp->tbp[i].weight); } } exit: @@ -733,7 +796,8 @@ passive_store(struct device *dev, struct device_attribute *attr, thermal_zone_bind_cooling_device(tz, THERMAL_TRIPS_NONE, cdev, THERMAL_NO_LIMIT, - THERMAL_NO_LIMIT); + THERMAL_NO_LIMIT, + THERMAL_WEIGHT_DEFAULT); } mutex_unlock(&thermal_list_lock); if (!tz->passive_delay) @@ -785,8 +849,9 @@ policy_store(struct device *dev, struct device_attribute *attr, if (!gov) goto exit; - tz->governor = gov; - ret = count; + ret = thermal_set_governor(tz, gov); + if (!ret) + ret = count; exit: mutex_unlock(&tz->lock); @@ -830,6 +895,158 @@ emul_temp_store(struct device *dev, struct device_attribute *attr, static DEVICE_ATTR(emul_temp, S_IWUSR, NULL, emul_temp_store); #endif/*CONFIG_THERMAL_EMULATION*/ +static ssize_t +sustainable_power_show(struct device *dev, struct device_attribute *devattr, + char *buf) +{ + struct thermal_zone_device *tz = to_thermal_zone(dev); + + if (tz->tzp) + return sprintf(buf, "%u\n", tz->tzp->sustainable_power); + else + return -EIO; +} + +static ssize_t +sustainable_power_store(struct device *dev, struct device_attribute *devattr, + const char *buf, size_t count) +{ + struct thermal_zone_device *tz = to_thermal_zone(dev); + u32 sustainable_power; + + if (!tz->tzp) + return -EIO; + + if (kstrtou32(buf, 10, &sustainable_power)) + return -EINVAL; + + tz->tzp->sustainable_power = sustainable_power; + + return count; +} +static DEVICE_ATTR(sustainable_power, S_IWUSR | S_IRUGO, sustainable_power_show, + sustainable_power_store); + +#define create_s32_tzp_attr(name) \ + static ssize_t \ + name##_show(struct device *dev, struct device_attribute *devattr, \ + char *buf) \ + { \ + struct thermal_zone_device *tz = to_thermal_zone(dev); \ + \ + if (tz->tzp) \ + return sprintf(buf, "%u\n", tz->tzp->name); \ + else \ + return -EIO; \ + } \ + \ + static ssize_t \ + name##_store(struct device *dev, struct device_attribute *devattr, \ + const char *buf, size_t count) \ + { \ + struct thermal_zone_device *tz = to_thermal_zone(dev); \ + s32 value; \ + \ + if (!tz->tzp) \ + return -EIO; \ + \ + if (kstrtos32(buf, 10, &value)) \ + return -EINVAL; \ + \ + tz->tzp->name = value; \ + \ + return count; \ + } \ + static DEVICE_ATTR(name, S_IWUSR | S_IRUGO, name##_show, name##_store) + +create_s32_tzp_attr(k_po); +create_s32_tzp_attr(k_pu); +create_s32_tzp_attr(k_i); +create_s32_tzp_attr(k_d); +create_s32_tzp_attr(integral_cutoff); +create_s32_tzp_attr(slope); +create_s32_tzp_attr(offset); +#undef create_s32_tzp_attr + +static struct device_attribute *dev_tzp_attrs[] = { + &dev_attr_sustainable_power, + &dev_attr_k_po, + &dev_attr_k_pu, + &dev_attr_k_i, + &dev_attr_k_d, + &dev_attr_integral_cutoff, + &dev_attr_slope, + &dev_attr_offset, +}; + +static int create_tzp_attrs(struct device *dev) +{ + int i; + + for (i = 0; i < ARRAY_SIZE(dev_tzp_attrs); i++) { + int ret; + struct device_attribute *dev_attr = dev_tzp_attrs[i]; + + ret = device_create_file(dev, dev_attr); + if (ret) + return ret; + } + + return 0; +} + +/** + * power_actor_get_max_power() - get the maximum power that a cdev can consume + * @cdev: pointer to &thermal_cooling_device + * @tz: a valid thermal zone device pointer + * @max_power: pointer in which to store the maximum power + * + * Calculate the maximum power consumption in milliwats that the + * cooling device can currently consume and store it in @max_power. + * + * Return: 0 on success, -EINVAL if @cdev doesn't support the + * power_actor API or -E* on other error. + */ +int power_actor_get_max_power(struct thermal_cooling_device *cdev, + struct thermal_zone_device *tz, u32 *max_power) +{ + if (!cdev_is_power_actor(cdev)) + return -EINVAL; + + return cdev->ops->state2power(cdev, tz, 0, max_power); +} + +/** + * power_actor_set_power() - limit the maximum power that a cooling device can consume + * @cdev: pointer to &thermal_cooling_device + * @instance: thermal instance to update + * @power: the power in milliwatts + * + * Set the cooling device to consume at most @power milliwatts. + * + * Return: 0 on success, -EINVAL if the cooling device does not + * implement the power actor API or -E* for other failures. + */ +int power_actor_set_power(struct thermal_cooling_device *cdev, + struct thermal_instance *instance, u32 power) +{ + unsigned long state; + int ret; + + if (!cdev_is_power_actor(cdev)) + return -EINVAL; + + ret = cdev->ops->power2state(cdev, instance->tz, power, &state); + if (ret) + return ret; + + instance->target = state; + cdev->updated = false; + thermal_cdev_update(cdev); + + return 0; +} + static DEVICE_ATTR(type, 0444, type_show, NULL); static DEVICE_ATTR(temp, 0444, temp_show, NULL); static DEVICE_ATTR(mode, 0644, mode_show, mode_store); @@ -937,6 +1154,34 @@ static const struct attribute_group *cooling_device_attr_groups[] = { NULL, }; +static ssize_t +thermal_cooling_device_weight_show(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct thermal_instance *instance; + + instance = container_of(attr, struct thermal_instance, weight_attr); + + return sprintf(buf, "%d\n", instance->weight); +} + +static ssize_t +thermal_cooling_device_weight_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct thermal_instance *instance; + int ret, weight; + + ret = kstrtoint(buf, 0, &weight); + if (ret) + return ret; + + instance = container_of(attr, struct thermal_instance, weight_attr); + instance->weight = weight; + + return count; +} /* Device management */ /** @@ -951,6 +1196,9 @@ static const struct attribute_group *cooling_device_attr_groups[] = { * @lower: the Minimum cooling state can be used for this trip point. * THERMAL_NO_LIMIT means no lower limit, * and the cooling device can be in cooling state 0. + * @weight: The weight of the cooling device to be bound to the + * thermal zone. Use THERMAL_WEIGHT_DEFAULT for the + * default value * * This interface function bind a thermal cooling device to the certain trip * point of a thermal zone device. @@ -961,7 +1209,8 @@ static const struct attribute_group *cooling_device_attr_groups[] = { int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz, int trip, struct thermal_cooling_device *cdev, - unsigned long upper, unsigned long lower) + unsigned long upper, unsigned long lower, + unsigned int weight) { struct thermal_instance *dev; struct thermal_instance *pos; @@ -1006,6 +1255,7 @@ int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz, dev->upper = upper; dev->lower = lower; dev->target = THERMAL_NO_TARGET; + dev->weight = weight; result = get_idr(&tz->idr, &tz->lock, &dev->id); if (result) @@ -1026,6 +1276,16 @@ int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz, if (result) goto remove_symbol_link; + sprintf(dev->weight_attr_name, "cdev%d_weight", dev->id); + sysfs_attr_init(&dev->weight_attr.attr); + dev->weight_attr.attr.name = dev->weight_attr_name; + dev->weight_attr.attr.mode = S_IWUSR | S_IRUGO; + dev->weight_attr.show = thermal_cooling_device_weight_show; + dev->weight_attr.store = thermal_cooling_device_weight_store; + result = device_create_file(&tz->device, &dev->weight_attr); + if (result) + goto remove_trip_file; + mutex_lock(&tz->lock); mutex_lock(&cdev->lock); list_for_each_entry(pos, &tz->thermal_instances, tz_node) @@ -1044,6 +1304,8 @@ int thermal_zone_bind_cooling_device(struct thermal_zone_device *tz, if (!result) return 0; + device_remove_file(&tz->device, &dev->weight_attr); +remove_trip_file: device_remove_file(&tz->device, &dev->attr); remove_symbol_link: sysfs_remove_link(&tz->device.kobj, dev->name); @@ -1405,7 +1667,8 @@ static int create_trip_attrs(struct thermal_zone_device *tz, int mask) tz->trip_temp_attrs[indx].name; tz->trip_temp_attrs[indx].attr.attr.mode = S_IRUGO; tz->trip_temp_attrs[indx].attr.show = trip_point_temp_show; - if (mask & (1 << indx)) { + if (IS_ENABLED(CONFIG_THERMAL_WRITABLE_TRIPS) && + mask & (1 << indx)) { tz->trip_temp_attrs[indx].attr.attr.mode |= S_IWUSR; tz->trip_temp_attrs[indx].attr.store = trip_point_temp_store; @@ -1482,7 +1745,7 @@ static void remove_trip_attrs(struct thermal_zone_device *tz) struct thermal_zone_device *thermal_zone_device_register(const char *type, int trips, int mask, void *devdata, struct thermal_zone_device_ops *ops, - const struct thermal_zone_params *tzp, + struct thermal_zone_params *tzp, int passive_delay, int polling_delay) { struct thermal_zone_device *tz; @@ -1490,6 +1753,7 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type, int result; int count; int passive = 0; + struct thermal_governor *governor; if (type && strlen(type) >= THERMAL_NAME_LENGTH) return ERR_PTR(-EINVAL); @@ -1578,13 +1842,24 @@ struct thermal_zone_device *thermal_zone_device_register(const char *type, if (result) goto unregister; + /* Add thermal zone params */ + result = create_tzp_attrs(&tz->device); + if (result) + goto unregister; + /* Update 'this' zone's governor information */ mutex_lock(&thermal_governor_lock); if (tz->tzp) - tz->governor = __find_governor(tz->tzp->governor_name); + governor = __find_governor(tz->tzp->governor_name); else - tz->governor = def_governor; + governor = def_governor; + + result = thermal_set_governor(tz, governor); + if (result) { + mutex_unlock(&thermal_governor_lock); + goto unregister; + } mutex_unlock(&thermal_governor_lock); @@ -1676,7 +1951,7 @@ void thermal_zone_device_unregister(struct thermal_zone_device *tz) device_remove_file(&tz->device, &dev_attr_mode); device_remove_file(&tz->device, &dev_attr_policy); remove_trip_attrs(tz); - tz->governor = NULL; + thermal_set_governor(tz, NULL); thermal_remove_hwmon_sysfs(tz); release_idr(&thermal_tz_idr, &thermal_idr_lock, tz->id); @@ -1832,7 +2107,11 @@ static int __init thermal_register_governors(void) if (result) return result; - return thermal_gov_user_space_register(); + result = thermal_gov_user_space_register(); + if (result) + return result; + + return thermal_gov_power_allocator_register(); } static void thermal_unregister_governors(void) @@ -1841,6 +2120,7 @@ static void thermal_unregister_governors(void) thermal_gov_fair_share_unregister(); thermal_gov_bang_bang_unregister(); thermal_gov_user_space_unregister(); + thermal_gov_power_allocator_unregister(); } static int thermal_pm_notify(struct notifier_block *nb, diff --git a/drivers/thermal/thermal_core.h b/drivers/thermal/thermal_core.h index dce86ee8e9d7..749d41abfbab 100644 --- a/drivers/thermal/thermal_core.h +++ b/drivers/thermal/thermal_core.h @@ -47,8 +47,11 @@ struct thermal_instance { unsigned long target; /* expected cooling state */ char attr_name[THERMAL_NAME_LENGTH]; struct device_attribute attr; + char weight_attr_name[THERMAL_NAME_LENGTH]; + struct device_attribute weight_attr; struct list_head tz_node; /* node in tz->thermal_instances */ struct list_head cdev_node; /* node in cdev->thermal_instances */ + unsigned int weight; /* The weight of the cooling device */ }; int thermal_register_governor(struct thermal_governor *); @@ -86,6 +89,14 @@ static inline int thermal_gov_user_space_register(void) { return 0; } static inline void thermal_gov_user_space_unregister(void) {} #endif /* CONFIG_THERMAL_GOV_USER_SPACE */ +#ifdef CONFIG_THERMAL_GOV_POWER_ALLOCATOR +int thermal_gov_power_allocator_register(void); +void thermal_gov_power_allocator_unregister(void); +#else +static inline int thermal_gov_power_allocator_register(void) { return 0; } +static inline void thermal_gov_power_allocator_unregister(void) {} +#endif /* CONFIG_THERMAL_GOV_POWER_ALLOCATOR */ + /* device tree support */ #ifdef CONFIG_THERMAL_OF int of_parse_thermal_zones(void); diff --git a/drivers/thermal/ti-soc-thermal/ti-thermal-common.c b/drivers/thermal/ti-soc-thermal/ti-thermal-common.c index a38c1756442a..cb45e729adb5 100644 --- a/drivers/thermal/ti-soc-thermal/ti-thermal-common.c +++ b/drivers/thermal/ti-soc-thermal/ti-thermal-common.c @@ -146,7 +146,8 @@ static int ti_thermal_bind(struct thermal_zone_device *thermal, return thermal_zone_bind_cooling_device(thermal, 0, cdev, /* bind with min and max states defined by cpu_cooling */ THERMAL_NO_LIMIT, - THERMAL_NO_LIMIT); + THERMAL_NO_LIMIT, + THERMAL_WEIGHT_DEFAULT); } /* Unbind callback functions for thermal zone */ diff --git a/fs/block_dev.c b/fs/block_dev.c index ccfd31f1df3a..b1aa9877acba 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -14,6 +14,7 @@ #include <linux/device_cgroup.h> #include <linux/highmem.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/module.h> #include <linux/blkpg.h> #include <linux/magic.h> @@ -546,7 +547,8 @@ static struct file_system_type bd_type = { .kill_sb = kill_anon_super, }; -static struct super_block *blockdev_superblock __read_mostly; +struct super_block *blockdev_superblock __read_mostly; +EXPORT_SYMBOL_GPL(blockdev_superblock); void __init bdev_cache_init(void) { @@ -687,11 +689,6 @@ static struct block_device *bd_acquire(struct inode *inode) return bdev; } -int sb_is_blkdev_sb(struct super_block *sb) -{ - return sb == blockdev_superblock; -} - /* Call when you free inode */ void bd_forget(struct inode *inode) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 99e8f60c7962..ff10d6e093f5 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3272,11 +3272,8 @@ static int write_dev_supers(struct btrfs_device *device, */ static void btrfs_end_empty_barrier(struct bio *bio, int err) { - if (err) { - if (err == -EOPNOTSUPP) - set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); + if (err) clear_bit(BIO_UPTODATE, &bio->bi_flags); - } if (bio->bi_private) complete(bio->bi_private); bio_put(bio); @@ -3304,11 +3301,7 @@ static int write_dev_flush(struct btrfs_device *device, int wait) wait_for_completion(&device->flush_wait); - if (bio_flagged(bio, BIO_EOPNOTSUPP)) { - printk_in_rcu("BTRFS: disabling barriers on dev %s\n", - rcu_str_deref(device->name)); - device->nobarriers = 1; - } else if (!bio_flagged(bio, BIO_UPTODATE)) { + if (!bio_flagged(bio, BIO_UPTODATE)) { ret = -EIO; btrfs_dev_stat_inc_and_print(device, BTRFS_DEV_STAT_FLUSH_ERRS); diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 885f533a34d9..32e0df2d0bd6 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2767,8 +2767,6 @@ static int __must_check submit_one_bio(int rw, struct bio *bio, else btrfsic_submit_bio(rw, bio); - if (bio_flagged(bio, BIO_EOPNOTSUPP)) - ret = -EOPNOTSUPP; bio_put(bio); return ret; } diff --git a/fs/buffer.c b/fs/buffer.c index c7a5602d01ee..32d33b7af15d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -30,6 +30,7 @@ #include <linux/quotaops.h> #include <linux/highmem.h> #include <linux/export.h> +#include <linux/backing-dev.h> #include <linux/writeback.h> #include <linux/hash.h> #include <linux/suspend.h> @@ -44,6 +45,9 @@ #include <trace/events/block.h> static int fsync_buffers_list(spinlock_t *lock, struct list_head *list); +static int submit_bh_wbc(int rw, struct buffer_head *bh, + unsigned long bio_flags, + struct writeback_control *wbc); #define BH_ENTRY(list) list_entry((list), struct buffer_head, b_assoc_buffers) @@ -623,21 +627,22 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); * * If warn is true, then emit a warning if the page is not uptodate and has * not been truncated. + * + * The caller must hold mem_cgroup_begin_page_stat() lock. */ -static void __set_page_dirty(struct page *page, - struct address_space *mapping, int warn) +static void __set_page_dirty(struct page *page, struct address_space *mapping, + struct mem_cgroup *memcg, int warn) { unsigned long flags; spin_lock_irqsave(&mapping->tree_lock, flags); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); - account_page_dirtied(page, mapping); + account_page_dirtied(page, mapping, memcg); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } spin_unlock_irqrestore(&mapping->tree_lock, flags); - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } /* @@ -668,6 +673,7 @@ static void __set_page_dirty(struct page *page, int __set_page_dirty_buffers(struct page *page) { int newly_dirty; + struct mem_cgroup *memcg; struct address_space *mapping = page_mapping(page); if (unlikely(!mapping)) @@ -683,11 +689,22 @@ int __set_page_dirty_buffers(struct page *page) bh = bh->b_this_page; } while (bh != head); } + /* + * Use mem_group_begin_page_stat() to keep PageDirty synchronized with + * per-memcg dirty page counters. + */ + memcg = mem_cgroup_begin_page_stat(page); newly_dirty = !TestSetPageDirty(page); spin_unlock(&mapping->private_lock); if (newly_dirty) - __set_page_dirty(page, mapping, 1); + __set_page_dirty(page, mapping, memcg, 1); + + mem_cgroup_end_page_stat(memcg); + + if (newly_dirty) + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + return newly_dirty; } EXPORT_SYMBOL(__set_page_dirty_buffers); @@ -1158,11 +1175,18 @@ void mark_buffer_dirty(struct buffer_head *bh) if (!test_set_buffer_dirty(bh)) { struct page *page = bh->b_page; + struct address_space *mapping = NULL; + struct mem_cgroup *memcg; + + memcg = mem_cgroup_begin_page_stat(page); if (!TestSetPageDirty(page)) { - struct address_space *mapping = page_mapping(page); + mapping = page_mapping(page); if (mapping) - __set_page_dirty(page, mapping, 0); + __set_page_dirty(page, mapping, memcg, 0); } + mem_cgroup_end_page_stat(memcg); + if (mapping) + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } } EXPORT_SYMBOL(mark_buffer_dirty); @@ -1684,8 +1708,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, struct buffer_head *bh, *head; unsigned int blocksize, bbits; int nr_underway = 0; - int write_op = (wbc->sync_mode == WB_SYNC_ALL ? - WRITE_SYNC : WRITE); + int write_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); head = create_page_buffers(page, inode, (1 << BH_Dirty)|(1 << BH_Uptodate)); @@ -1774,7 +1797,7 @@ static int __block_write_full_page(struct inode *inode, struct page *page, do { struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { - submit_bh(write_op, bh); + submit_bh_wbc(write_op, bh, 0, wbc); nr_underway++; } bh = next; @@ -1828,7 +1851,7 @@ recover: struct buffer_head *next = bh->b_this_page; if (buffer_async_write(bh)) { clear_buffer_dirty(bh); - submit_bh(write_op, bh); + submit_bh_wbc(write_op, bh, 0, wbc); nr_underway++; } bh = next; @@ -2938,10 +2961,6 @@ static void end_bio_bh_io_sync(struct bio *bio, int err) { struct buffer_head *bh = bio->bi_private; - if (err == -EOPNOTSUPP) { - set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); - } - if (unlikely (test_bit(BIO_QUIET,&bio->bi_flags))) set_bit(BH_Quiet, &bh->b_state); @@ -2997,7 +3016,8 @@ void guard_bio_eod(int rw, struct bio *bio) } } -int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) +static int submit_bh_wbc(int rw, struct buffer_head *bh, + unsigned long bio_flags, struct writeback_control *wbc) { struct bio *bio; int ret = 0; @@ -3020,6 +3040,11 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) */ bio = bio_alloc(GFP_NOIO, 1); + if (wbc) { + wbc_init_bio(wbc, bio); + wbc_account_io(wbc, bh->b_page, bh->b_size); + } + bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio->bi_bdev = bh->b_bdev; bio->bi_io_vec[0].bv_page = bh->b_page; @@ -3041,20 +3066,19 @@ int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) if (buffer_prio(bh)) rw |= REQ_PRIO; - bio_get(bio); submit_bio(rw, bio); - - if (bio_flagged(bio, BIO_EOPNOTSUPP)) - ret = -EOPNOTSUPP; - - bio_put(bio); return ret; } + +int _submit_bh(int rw, struct buffer_head *bh, unsigned long bio_flags) +{ + return submit_bh_wbc(rw, bh, bio_flags, NULL); +} EXPORT_SYMBOL_GPL(_submit_bh); int submit_bh(int rw, struct buffer_head *bh) { - return _submit_bh(rw, bh, 0); + return submit_bh_wbc(rw, bh, 0, NULL); } EXPORT_SYMBOL(submit_bh); @@ -3243,8 +3267,8 @@ int try_to_free_buffers(struct page *page) * to synchronise against __set_page_dirty_buffers and prevent the * dirty bit from being lost. */ - if (ret && TestClearPageDirty(page)) - account_page_cleaned(page, mapping); + if (ret) + cancel_dirty_page(page); spin_unlock(&mapping->private_lock); out: if (buffers_to_free) { diff --git a/fs/ext2/super.c b/fs/ext2/super.c index d0e746e96511..900e19cf9ef6 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -882,6 +882,7 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | ((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); + sb->s_iflags |= SB_I_CGROUPWB; if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV && (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) || diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9a83f149ac85..882a87fe0499 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -191,7 +191,7 @@ typedef struct ext4_io_end { } ext4_io_end_t; struct ext4_io_submit { - int io_op; + struct writeback_control *io_wbc; struct bio *io_bio; ext4_io_end_t *io_end; sector_t io_next_block; diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 87ba10d1d3bc..97c0d60eb63d 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -39,6 +39,7 @@ #include <linux/slab.h> #include <asm/uaccess.h> #include <linux/fiemap.h> +#include <linux/backing-dev.h> #include "ext4_jbd2.h" #include "ext4_extents.h" #include "xattr.h" diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 41260489d3bc..5b1613a54307 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -26,6 +26,7 @@ #include <linux/log2.h> #include <linux/module.h> #include <linux/slab.h> +#include <linux/backing-dev.h> #include <trace/events/ext4.h> #ifdef CONFIG_EXT4_DEBUG diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 8082565c59a9..00cda09e4412 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -357,9 +357,10 @@ void ext4_io_submit(struct ext4_io_submit *io) struct bio *bio = io->io_bio; if (bio) { + int io_op = io->io_wbc->sync_mode == WB_SYNC_ALL ? + WRITE_SYNC : WRITE; bio_get(io->io_bio); - submit_bio(io->io_op, io->io_bio); - BUG_ON(bio_flagged(io->io_bio, BIO_EOPNOTSUPP)); + submit_bio(io_op, io->io_bio); bio_put(io->io_bio); } io->io_bio = NULL; @@ -368,7 +369,7 @@ void ext4_io_submit(struct ext4_io_submit *io) void ext4_io_submit_init(struct ext4_io_submit *io, struct writeback_control *wbc) { - io->io_op = (wbc->sync_mode == WB_SYNC_ALL ? WRITE_SYNC : WRITE); + io->io_wbc = wbc; io->io_bio = NULL; io->io_end = NULL; } @@ -382,6 +383,7 @@ static int io_submit_init_bio(struct ext4_io_submit *io, bio = bio_alloc(GFP_NOIO, min(nvecs, BIO_MAX_PAGES)); if (!bio) return -ENOMEM; + wbc_init_bio(io->io_wbc, bio); bio->bi_iter.bi_sector = bh->b_blocknr * (bh->b_size >> 9); bio->bi_bdev = bh->b_bdev; bio->bi_end_io = ext4_end_bio; @@ -410,6 +412,7 @@ submit_and_retry: ret = bio_add_page(io->io_bio, page, bh->b_size, bh_offset(bh)); if (ret != bh->b_size) goto submit_and_retry; + wbc_account_io(io->io_wbc, page, bh->b_size); io->io_next_block++; return 0; } diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8a3b9f14d198..9d35b6ae7e5f 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -24,6 +24,7 @@ #include <linux/slab.h> #include <linux/init.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/parser.h> #include <linux/buffer_head.h> #include <linux/exportfs.h> @@ -3649,6 +3650,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) } if (test_opt(sb, DELALLOC)) clear_opt(sb, DELALLOC); + } else { + sb->s_iflags |= SB_I_CGROUPWB; } sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c index 8ab0cf1930bd..d211602e0f86 100644 --- a/fs/f2fs/node.c +++ b/fs/f2fs/node.c @@ -53,7 +53,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type) PAGE_CACHE_SHIFT; res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 2); } else if (type == DIRTY_DENTS) { - if (sbi->sb->s_bdi->dirty_exceeded) + if (sbi->sb->s_bdi->wb.dirty_exceeded) return false; mem_size = get_pages(sbi, F2FS_DIRTY_DENTS); res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); @@ -70,7 +70,7 @@ bool available_free_memory(struct f2fs_sb_info *sbi, int type) sizeof(struct extent_node)) >> PAGE_CACHE_SHIFT; res = mem_size < ((avail_ram * nm_i->ram_thresh / 100) >> 1); } else { - if (sbi->sb->s_bdi->dirty_exceeded) + if (sbi->sb->s_bdi->wb.dirty_exceeded) return false; } return res; diff --git a/fs/f2fs/segment.h b/fs/f2fs/segment.h index 85d7fa7514b2..aba72f7a8ac4 100644 --- a/fs/f2fs/segment.h +++ b/fs/f2fs/segment.h @@ -9,6 +9,7 @@ * published by the Free Software Foundation. */ #include <linux/blkdev.h> +#include <linux/backing-dev.h> /* constant macro */ #define NULL_SEGNO ((unsigned int)(~0)) @@ -713,7 +714,7 @@ static inline unsigned int max_hw_blocks(struct f2fs_sb_info *sbi) */ static inline int nr_pages_to_skip(struct f2fs_sb_info *sbi, int type) { - if (sbi->sb->s_bdi->dirty_exceeded) + if (sbi->sb->s_bdi->wb.dirty_exceeded) return 0; if (type == DATA) diff --git a/fs/fat/file.c b/fs/fat/file.c index 442d50a0e33e..a08f1039909a 100644 --- a/fs/fat/file.c +++ b/fs/fat/file.c @@ -11,6 +11,7 @@ #include <linux/compat.h> #include <linux/mount.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/fsnotify.h> #include <linux/security.h> #include "fat.h" diff --git a/fs/fat/inode.c b/fs/fat/inode.c index c06774658345..509411dd3698 100644 --- a/fs/fat/inode.c +++ b/fs/fat/inode.c @@ -18,6 +18,7 @@ #include <linux/parser.h> #include <linux/uio.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <asm/unaligned.h> #include "fat.h" diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 32a8bbd7a9ad..518c6294bf6c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -27,6 +27,7 @@ #include <linux/backing-dev.h> #include <linux/tracepoint.h> #include <linux/device.h> +#include <linux/memcontrol.h> #include "internal.h" /* @@ -34,6 +35,10 @@ */ #define MIN_WRITEBACK_PAGES (4096UL >> (PAGE_CACHE_SHIFT - 10)) +struct wb_completion { + atomic_t cnt; +}; + /* * Passed into wb_writeback(), essentially a subset of writeback_control */ @@ -47,13 +52,29 @@ struct wb_writeback_work { unsigned int range_cyclic:1; unsigned int for_background:1; unsigned int for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ + unsigned int auto_free:1; /* free on completion */ + unsigned int single_wait:1; + unsigned int single_done:1; enum wb_reason reason; /* why was writeback initiated? */ struct list_head list; /* pending work list */ - struct completion *done; /* set if the caller waits */ + struct wb_completion *done; /* set if the caller waits */ }; /* + * If one wants to wait for one or more wb_writeback_works, each work's + * ->done should be set to a wb_completion defined using the following + * macro. Once all work items are issued with wb_queue_work(), the caller + * can wait for the completion of all using wb_wait_for_completion(). Work + * items which are waited upon aren't freed automatically on completion. + */ +#define DEFINE_WB_COMPLETION_ONSTACK(cmpl) \ + struct wb_completion cmpl = { \ + .cnt = ATOMIC_INIT(1), \ + } + + +/* * If an inode is constantly having its pages dirtied, but then the * updates stop dirtytime_expire_interval seconds in the past, it's * possible for the worst case time between when an inode has its @@ -65,35 +86,6 @@ struct wb_writeback_work { */ unsigned int dirtytime_expire_interval = 12 * 60 * 60; -/** - * writeback_in_progress - determine whether there is writeback in progress - * @bdi: the device's backing_dev_info structure. - * - * Determine whether there is writeback waiting to be handled against a - * backing device. - */ -int writeback_in_progress(struct backing_dev_info *bdi) -{ - return test_bit(BDI_writeback_running, &bdi->state); -} -EXPORT_SYMBOL(writeback_in_progress); - -struct backing_dev_info *inode_to_bdi(struct inode *inode) -{ - struct super_block *sb; - - if (!inode) - return &noop_backing_dev_info; - - sb = inode->i_sb; -#ifdef CONFIG_BLOCK - if (sb_is_blkdev_sb(sb)) - return blk_get_backing_dev_info(I_BDEV(inode)); -#endif - return sb->s_bdi; -} -EXPORT_SYMBOL_GPL(inode_to_bdi); - static inline struct inode *wb_inode(struct list_head *head) { return list_entry(head, struct inode, i_wb_list); @@ -109,45 +101,831 @@ static inline struct inode *wb_inode(struct list_head *head) EXPORT_TRACEPOINT_SYMBOL_GPL(wbc_writepage); -static void bdi_wakeup_thread(struct backing_dev_info *bdi) +static bool wb_io_lists_populated(struct bdi_writeback *wb) +{ + if (wb_has_dirty_io(wb)) { + return false; + } else { + set_bit(WB_has_dirty_io, &wb->state); + WARN_ON_ONCE(!wb->avg_write_bandwidth); + atomic_long_add(wb->avg_write_bandwidth, + &wb->bdi->tot_write_bandwidth); + return true; + } +} + +static void wb_io_lists_depopulated(struct bdi_writeback *wb) { - spin_lock_bh(&bdi->wb_lock); - if (test_bit(BDI_registered, &bdi->state)) - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); - spin_unlock_bh(&bdi->wb_lock); + if (wb_has_dirty_io(wb) && list_empty(&wb->b_dirty) && + list_empty(&wb->b_io) && list_empty(&wb->b_more_io)) { + clear_bit(WB_has_dirty_io, &wb->state); + WARN_ON_ONCE(atomic_long_sub_return(wb->avg_write_bandwidth, + &wb->bdi->tot_write_bandwidth) < 0); + } } -static void bdi_queue_work(struct backing_dev_info *bdi, - struct wb_writeback_work *work) +/** + * inode_wb_list_move_locked - move an inode onto a bdi_writeback IO list + * @inode: inode to be moved + * @wb: target bdi_writeback + * @head: one of @wb->b_{dirty|io|more_io} + * + * Move @inode->i_wb_list to @list of @wb and set %WB_has_dirty_io. + * Returns %true if @inode is the first occupant of the !dirty_time IO + * lists; otherwise, %false. + */ +static bool inode_wb_list_move_locked(struct inode *inode, + struct bdi_writeback *wb, + struct list_head *head) { - trace_writeback_queue(bdi, work); + assert_spin_locked(&wb->list_lock); + + list_move(&inode->i_wb_list, head); - spin_lock_bh(&bdi->wb_lock); - if (!test_bit(BDI_registered, &bdi->state)) { - if (work->done) - complete(work->done); + /* dirty_time doesn't count as dirty_io until expiration */ + if (head != &wb->b_dirty_time) + return wb_io_lists_populated(wb); + + wb_io_lists_depopulated(wb); + return false; +} + +/** + * inode_wb_list_del_locked - remove an inode from its bdi_writeback IO list + * @inode: inode to be removed + * @wb: bdi_writeback @inode is being removed from + * + * Remove @inode which may be on one of @wb->b_{dirty|io|more_io} lists and + * clear %WB_has_dirty_io if all are empty afterwards. + */ +static void inode_wb_list_del_locked(struct inode *inode, + struct bdi_writeback *wb) +{ + assert_spin_locked(&wb->list_lock); + + list_del_init(&inode->i_wb_list); + wb_io_lists_depopulated(wb); +} + +static void wb_wakeup(struct bdi_writeback *wb) +{ + spin_lock_bh(&wb->work_lock); + if (test_bit(WB_registered, &wb->state)) + mod_delayed_work(bdi_wq, &wb->dwork, 0); + spin_unlock_bh(&wb->work_lock); +} + +static void wb_queue_work(struct bdi_writeback *wb, + struct wb_writeback_work *work) +{ + trace_writeback_queue(wb->bdi, work); + + spin_lock_bh(&wb->work_lock); + if (!test_bit(WB_registered, &wb->state)) { + if (work->single_wait) + work->single_done = 1; goto out_unlock; } - list_add_tail(&work->list, &bdi->work_list); - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); + if (work->done) + atomic_inc(&work->done->cnt); + list_add_tail(&work->list, &wb->work_list); + mod_delayed_work(bdi_wq, &wb->dwork, 0); out_unlock: - spin_unlock_bh(&bdi->wb_lock); + spin_unlock_bh(&wb->work_lock); +} + +/** + * wb_wait_for_completion - wait for completion of bdi_writeback_works + * @bdi: bdi work items were issued to + * @done: target wb_completion + * + * Wait for one or more work items issued to @bdi with their ->done field + * set to @done, which should have been defined with + * DEFINE_WB_COMPLETION_ONSTACK(). This function returns after all such + * work items are completed. Work items which are waited upon aren't freed + * automatically on completion. + */ +static void wb_wait_for_completion(struct backing_dev_info *bdi, + struct wb_completion *done) +{ + atomic_dec(&done->cnt); /* put down the initial count */ + wait_event(bdi->wb_waitq, !atomic_read(&done->cnt)); +} + +#ifdef CONFIG_CGROUP_WRITEBACK + +/* parameters for foreign inode detection, see wb_detach_inode() */ +#define WB_FRN_TIME_SHIFT 13 /* 1s = 2^13, upto 8 secs w/ 16bit */ +#define WB_FRN_TIME_AVG_SHIFT 3 /* avg = avg * 7/8 + new * 1/8 */ +#define WB_FRN_TIME_CUT_DIV 2 /* ignore rounds < avg / 2 */ +#define WB_FRN_TIME_PERIOD (2 * (1 << WB_FRN_TIME_SHIFT)) /* 2s */ + +#define WB_FRN_HIST_SLOTS 16 /* inode->i_wb_frn_history is 16bit */ +#define WB_FRN_HIST_UNIT (WB_FRN_TIME_PERIOD / WB_FRN_HIST_SLOTS) + /* each slot's duration is 2s / 16 */ +#define WB_FRN_HIST_THR_SLOTS (WB_FRN_HIST_SLOTS / 2) + /* if foreign slots >= 8, switch */ +#define WB_FRN_HIST_MAX_SLOTS (WB_FRN_HIST_THR_SLOTS / 2 + 1) + /* one round can affect upto 5 slots */ + +void __inode_attach_wb(struct inode *inode, struct page *page) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = NULL; + + if (inode_cgwb_enabled(inode)) { + struct cgroup_subsys_state *memcg_css; + + if (page) { + memcg_css = mem_cgroup_css_from_page(page); + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + } else { + /* must pin memcg_css, see wb_get_create() */ + memcg_css = task_get_css(current, memory_cgrp_id); + wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + css_put(memcg_css); + } + } + + if (!wb) + wb = &bdi->wb; + + /* + * There may be multiple instances of this function racing to + * update the same inode. Use cmpxchg() to tell the winner. + */ + if (unlikely(cmpxchg(&inode->i_wb, NULL, wb))) + wb_put(wb); +} + +/** + * locked_inode_to_wb_and_lock_list - determine a locked inode's wb and lock it + * @inode: inode of interest with i_lock held + * + * Returns @inode's wb with its list_lock held. @inode->i_lock must be + * held on entry and is released on return. The returned wb is guaranteed + * to stay @inode's associated wb until its list_lock is released. + */ +static struct bdi_writeback * +locked_inode_to_wb_and_lock_list(struct inode *inode) + __releases(&inode->i_lock) + __acquires(&wb->list_lock) +{ + while (true) { + struct bdi_writeback *wb = inode_to_wb(inode); + + /* + * inode_to_wb() association is protected by both + * @inode->i_lock and @wb->list_lock but list_lock nests + * outside i_lock. Drop i_lock and verify that the + * association hasn't changed after acquiring list_lock. + */ + wb_get(wb); + spin_unlock(&inode->i_lock); + spin_lock(&wb->list_lock); + wb_put(wb); /* not gonna deref it anymore */ + + /* i_wb may have changed inbetween, can't use inode_to_wb() */ + if (likely(wb == inode->i_wb)) + return wb; /* @inode already has ref */ + + spin_unlock(&wb->list_lock); + cpu_relax(); + spin_lock(&inode->i_lock); + } +} + +/** + * inode_to_wb_and_lock_list - determine an inode's wb and lock it + * @inode: inode of interest + * + * Same as locked_inode_to_wb_and_lock_list() but @inode->i_lock isn't held + * on entry. + */ +static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode) + __acquires(&wb->list_lock) +{ + spin_lock(&inode->i_lock); + return locked_inode_to_wb_and_lock_list(inode); +} + +struct inode_switch_wbs_context { + struct inode *inode; + struct bdi_writeback *new_wb; + + struct rcu_head rcu_head; + struct work_struct work; +}; + +static void inode_switch_wbs_work_fn(struct work_struct *work) +{ + struct inode_switch_wbs_context *isw = + container_of(work, struct inode_switch_wbs_context, work); + struct inode *inode = isw->inode; + struct address_space *mapping = inode->i_mapping; + struct bdi_writeback *old_wb = inode->i_wb; + struct bdi_writeback *new_wb = isw->new_wb; + struct radix_tree_iter iter; + bool switched = false; + void **slot; + + /* + * By the time control reaches here, RCU grace period has passed + * since I_WB_SWITCH assertion and all wb stat update transactions + * between unlocked_inode_to_wb_begin/end() are guaranteed to be + * synchronizing against mapping->tree_lock. + * + * Grabbing old_wb->list_lock, inode->i_lock and mapping->tree_lock + * gives us exclusion against all wb related operations on @inode + * including IO list manipulations and stat updates. + */ + if (old_wb < new_wb) { + spin_lock(&old_wb->list_lock); + spin_lock_nested(&new_wb->list_lock, SINGLE_DEPTH_NESTING); + } else { + spin_lock(&new_wb->list_lock); + spin_lock_nested(&old_wb->list_lock, SINGLE_DEPTH_NESTING); + } + spin_lock(&inode->i_lock); + spin_lock_irq(&mapping->tree_lock); + + /* + * Once I_FREEING is visible under i_lock, the eviction path owns + * the inode and we shouldn't modify ->i_wb_list. + */ + if (unlikely(inode->i_state & I_FREEING)) + goto skip_switch; + + /* + * Count and transfer stats. Note that PAGECACHE_TAG_DIRTY points + * to possibly dirty pages while PAGECACHE_TAG_WRITEBACK points to + * pages actually under underwriteback. + */ + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, + PAGECACHE_TAG_DIRTY) { + struct page *page = radix_tree_deref_slot_protected(slot, + &mapping->tree_lock); + if (likely(page) && PageDirty(page)) { + __dec_wb_stat(old_wb, WB_RECLAIMABLE); + __inc_wb_stat(new_wb, WB_RECLAIMABLE); + } + } + + radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter, 0, + PAGECACHE_TAG_WRITEBACK) { + struct page *page = radix_tree_deref_slot_protected(slot, + &mapping->tree_lock); + if (likely(page)) { + WARN_ON_ONCE(!PageWriteback(page)); + __dec_wb_stat(old_wb, WB_WRITEBACK); + __inc_wb_stat(new_wb, WB_WRITEBACK); + } + } + + wb_get(new_wb); + + /* + * Transfer to @new_wb's IO list if necessary. The specific list + * @inode was on is ignored and the inode is put on ->b_dirty which + * is always correct including from ->b_dirty_time. The transfer + * preserves @inode->dirtied_when ordering. + */ + if (!list_empty(&inode->i_wb_list)) { + struct inode *pos; + + inode_wb_list_del_locked(inode, old_wb); + inode->i_wb = new_wb; + list_for_each_entry(pos, &new_wb->b_dirty, i_wb_list) + if (time_after_eq(inode->dirtied_when, + pos->dirtied_when)) + break; + inode_wb_list_move_locked(inode, new_wb, pos->i_wb_list.prev); + } else { + inode->i_wb = new_wb; + } + + /* ->i_wb_frn updates may race wbc_detach_inode() but doesn't matter */ + inode->i_wb_frn_winner = 0; + inode->i_wb_frn_avg_time = 0; + inode->i_wb_frn_history = 0; + switched = true; +skip_switch: + /* + * Paired with load_acquire in unlocked_inode_to_wb_begin() and + * ensures that the new wb is visible if they see !I_WB_SWITCH. + */ + smp_store_release(&inode->i_state, inode->i_state & ~I_WB_SWITCH); + + spin_unlock_irq(&mapping->tree_lock); + spin_unlock(&inode->i_lock); + spin_unlock(&new_wb->list_lock); + spin_unlock(&old_wb->list_lock); + + if (switched) { + wb_wakeup(new_wb); + wb_put(old_wb); + } + wb_put(new_wb); + + iput(inode); + kfree(isw); +} + +static void inode_switch_wbs_rcu_fn(struct rcu_head *rcu_head) +{ + struct inode_switch_wbs_context *isw = container_of(rcu_head, + struct inode_switch_wbs_context, rcu_head); + + /* needs to grab bh-unsafe locks, bounce to work item */ + INIT_WORK(&isw->work, inode_switch_wbs_work_fn); + schedule_work(&isw->work); +} + +/** + * inode_switch_wbs - change the wb association of an inode + * @inode: target inode + * @new_wb_id: ID of the new wb + * + * Switch @inode's wb association to the wb identified by @new_wb_id. The + * switching is performed asynchronously and may fail silently. + */ +static void inode_switch_wbs(struct inode *inode, int new_wb_id) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct cgroup_subsys_state *memcg_css; + struct inode_switch_wbs_context *isw; + + /* noop if seems to be already in progress */ + if (inode->i_state & I_WB_SWITCH) + return; + + isw = kzalloc(sizeof(*isw), GFP_ATOMIC); + if (!isw) + return; + + /* find and pin the new wb */ + rcu_read_lock(); + memcg_css = css_from_id(new_wb_id, &memory_cgrp_subsys); + if (memcg_css) + isw->new_wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC); + rcu_read_unlock(); + if (!isw->new_wb) + goto out_free; + + /* while holding I_WB_SWITCH, no one else can update the association */ + spin_lock(&inode->i_lock); + if (inode->i_state & (I_WB_SWITCH | I_FREEING) || + inode_to_wb(inode) == isw->new_wb) { + spin_unlock(&inode->i_lock); + goto out_free; + } + inode->i_state |= I_WB_SWITCH; + spin_unlock(&inode->i_lock); + + ihold(inode); + isw->inode = inode; + + /* + * In addition to synchronizing among switchers, I_WB_SWITCH tells + * the RCU protected stat update paths to grab the mapping's + * tree_lock so that stat transfer can synchronize against them. + * Let's continue after I_WB_SWITCH is guaranteed to be visible. + */ + call_rcu(&isw->rcu_head, inode_switch_wbs_rcu_fn); + return; + +out_free: + if (isw->new_wb) + wb_put(isw->new_wb); + kfree(isw); +} + +/** + * wbc_attach_and_unlock_inode - associate wbc with target inode and unlock it + * @wbc: writeback_control of interest + * @inode: target inode + * + * @inode is locked and about to be written back under the control of @wbc. + * Record @inode's writeback context into @wbc and unlock the i_lock. On + * writeback completion, wbc_detach_inode() should be called. This is used + * to track the cgroup writeback context. + */ +void wbc_attach_and_unlock_inode(struct writeback_control *wbc, + struct inode *inode) +{ + if (!inode_cgwb_enabled(inode)) { + spin_unlock(&inode->i_lock); + return; + } + + wbc->wb = inode_to_wb(inode); + wbc->inode = inode; + + wbc->wb_id = wbc->wb->memcg_css->id; + wbc->wb_lcand_id = inode->i_wb_frn_winner; + wbc->wb_tcand_id = 0; + wbc->wb_bytes = 0; + wbc->wb_lcand_bytes = 0; + wbc->wb_tcand_bytes = 0; + + wb_get(wbc->wb); + spin_unlock(&inode->i_lock); + + /* + * A dying wb indicates that the memcg-blkcg mapping has changed + * and a new wb is already serving the memcg. Switch immediately. + */ + if (unlikely(wb_dying(wbc->wb))) + inode_switch_wbs(inode, wbc->wb_id); +} + +/** + * wbc_detach_inode - disassociate wbc from inode and perform foreign detection + * @wbc: writeback_control of the just finished writeback + * + * To be called after a writeback attempt of an inode finishes and undoes + * wbc_attach_and_unlock_inode(). Can be called under any context. + * + * As concurrent write sharing of an inode is expected to be very rare and + * memcg only tracks page ownership on first-use basis severely confining + * the usefulness of such sharing, cgroup writeback tracks ownership + * per-inode. While the support for concurrent write sharing of an inode + * is deemed unnecessary, an inode being written to by different cgroups at + * different points in time is a lot more common, and, more importantly, + * charging only by first-use can too readily lead to grossly incorrect + * behaviors (single foreign page can lead to gigabytes of writeback to be + * incorrectly attributed). + * + * To resolve this issue, cgroup writeback detects the majority dirtier of + * an inode and transfers the ownership to it. To avoid unnnecessary + * oscillation, the detection mechanism keeps track of history and gives + * out the switch verdict only if the foreign usage pattern is stable over + * a certain amount of time and/or writeback attempts. + * + * On each writeback attempt, @wbc tries to detect the majority writer + * using Boyer-Moore majority vote algorithm. In addition to the byte + * count from the majority voting, it also counts the bytes written for the + * current wb and the last round's winner wb (max of last round's current + * wb, the winner from two rounds ago, and the last round's majority + * candidate). Keeping track of the historical winner helps the algorithm + * to semi-reliably detect the most active writer even when it's not the + * absolute majority. + * + * Once the winner of the round is determined, whether the winner is + * foreign or not and how much IO time the round consumed is recorded in + * inode->i_wb_frn_history. If the amount of recorded foreign IO time is + * over a certain threshold, the switch verdict is given. + */ +void wbc_detach_inode(struct writeback_control *wbc) +{ + struct bdi_writeback *wb = wbc->wb; + struct inode *inode = wbc->inode; + unsigned long avg_time, max_bytes, max_time; + u16 history; + int max_id; + + if (!wb) + return; + + history = inode->i_wb_frn_history; + avg_time = inode->i_wb_frn_avg_time; + + /* pick the winner of this round */ + if (wbc->wb_bytes >= wbc->wb_lcand_bytes && + wbc->wb_bytes >= wbc->wb_tcand_bytes) { + max_id = wbc->wb_id; + max_bytes = wbc->wb_bytes; + } else if (wbc->wb_lcand_bytes >= wbc->wb_tcand_bytes) { + max_id = wbc->wb_lcand_id; + max_bytes = wbc->wb_lcand_bytes; + } else { + max_id = wbc->wb_tcand_id; + max_bytes = wbc->wb_tcand_bytes; + } + + /* + * Calculate the amount of IO time the winner consumed and fold it + * into the running average kept per inode. If the consumed IO + * time is lower than avag / WB_FRN_TIME_CUT_DIV, ignore it for + * deciding whether to switch or not. This is to prevent one-off + * small dirtiers from skewing the verdict. + */ + max_time = DIV_ROUND_UP((max_bytes >> PAGE_SHIFT) << WB_FRN_TIME_SHIFT, + wb->avg_write_bandwidth); + if (avg_time) + avg_time += (max_time >> WB_FRN_TIME_AVG_SHIFT) - + (avg_time >> WB_FRN_TIME_AVG_SHIFT); + else + avg_time = max_time; /* immediate catch up on first run */ + + if (max_time >= avg_time / WB_FRN_TIME_CUT_DIV) { + int slots; + + /* + * The switch verdict is reached if foreign wb's consume + * more than a certain proportion of IO time in a + * WB_FRN_TIME_PERIOD. This is loosely tracked by 16 slot + * history mask where each bit represents one sixteenth of + * the period. Determine the number of slots to shift into + * history from @max_time. + */ + slots = min(DIV_ROUND_UP(max_time, WB_FRN_HIST_UNIT), + (unsigned long)WB_FRN_HIST_MAX_SLOTS); + history <<= slots; + if (wbc->wb_id != max_id) + history |= (1U << slots) - 1; + + /* + * Switch if the current wb isn't the consistent winner. + * If there are multiple closely competing dirtiers, the + * inode may switch across them repeatedly over time, which + * is okay. The main goal is avoiding keeping an inode on + * the wrong wb for an extended period of time. + */ + if (hweight32(history) > WB_FRN_HIST_THR_SLOTS) + inode_switch_wbs(inode, max_id); + } + + /* + * Multiple instances of this function may race to update the + * following fields but we don't mind occassional inaccuracies. + */ + inode->i_wb_frn_winner = max_id; + inode->i_wb_frn_avg_time = min(avg_time, (unsigned long)U16_MAX); + inode->i_wb_frn_history = history; + + wb_put(wbc->wb); + wbc->wb = NULL; +} + +/** + * wbc_account_io - account IO issued during writeback + * @wbc: writeback_control of the writeback in progress + * @page: page being written out + * @bytes: number of bytes being written out + * + * @bytes from @page are about to written out during the writeback + * controlled by @wbc. Keep the book for foreign inode detection. See + * wbc_detach_inode(). + */ +void wbc_account_io(struct writeback_control *wbc, struct page *page, + size_t bytes) +{ + int id; + + /* + * pageout() path doesn't attach @wbc to the inode being written + * out. This is intentional as we don't want the function to block + * behind a slow cgroup. Ultimately, we want pageout() to kick off + * regular writeback instead of writing things out itself. + */ + if (!wbc->wb) + return; + + rcu_read_lock(); + id = mem_cgroup_css_from_page(page)->id; + rcu_read_unlock(); + + if (id == wbc->wb_id) { + wbc->wb_bytes += bytes; + return; + } + + if (id == wbc->wb_lcand_id) + wbc->wb_lcand_bytes += bytes; + + /* Boyer-Moore majority vote algorithm */ + if (!wbc->wb_tcand_bytes) + wbc->wb_tcand_id = id; + if (id == wbc->wb_tcand_id) + wbc->wb_tcand_bytes += bytes; + else + wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes); } +EXPORT_SYMBOL_GPL(wbc_account_io); -static void -__bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, - bool range_cyclic, enum wb_reason reason) +/** + * inode_congested - test whether an inode is congested + * @inode: inode to test for congestion + * @cong_bits: mask of WB_[a]sync_congested bits to test + * + * Tests whether @inode is congested. @cong_bits is the mask of congestion + * bits to test and the return value is the mask of set bits. + * + * If cgroup writeback is enabled for @inode, the congestion state is + * determined by whether the cgwb (cgroup bdi_writeback) for the blkcg + * associated with @inode is congested; otherwise, the root wb's congestion + * state is used. + */ +int inode_congested(struct inode *inode, int cong_bits) +{ + /* + * Once set, ->i_wb never becomes NULL while the inode is alive. + * Start transaction iff ->i_wb is visible. + */ + if (inode && inode_to_wb_is_valid(inode)) { + struct bdi_writeback *wb; + bool locked, congested; + + wb = unlocked_inode_to_wb_begin(inode, &locked); + congested = wb_congested(wb, cong_bits); + unlocked_inode_to_wb_end(inode, locked); + return congested; + } + + return wb_congested(&inode_to_bdi(inode)->wb, cong_bits); +} +EXPORT_SYMBOL_GPL(inode_congested); + +/** + * wb_wait_for_single_work - wait for completion of a single bdi_writeback_work + * @bdi: bdi the work item was issued to + * @work: work item to wait for + * + * Wait for the completion of @work which was issued to one of @bdi's + * bdi_writeback's. The caller must have set @work->single_wait before + * issuing it. This wait operates independently fo + * wb_wait_for_completion() and also disables automatic freeing of @work. + */ +static void wb_wait_for_single_work(struct backing_dev_info *bdi, + struct wb_writeback_work *work) +{ + if (WARN_ON_ONCE(!work->single_wait)) + return; + + wait_event(bdi->wb_waitq, work->single_done); + + /* + * Paired with smp_wmb() in wb_do_writeback() and ensures that all + * modifications to @work prior to assertion of ->single_done is + * visible to the caller once this function returns. + */ + smp_rmb(); +} + +/** + * wb_split_bdi_pages - split nr_pages to write according to bandwidth + * @wb: target bdi_writeback to split @nr_pages to + * @nr_pages: number of pages to write for the whole bdi + * + * Split @wb's portion of @nr_pages according to @wb's write bandwidth in + * relation to the total write bandwidth of all wb's w/ dirty inodes on + * @wb->bdi. + */ +static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) +{ + unsigned long this_bw = wb->avg_write_bandwidth; + unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth); + + if (nr_pages == LONG_MAX) + return LONG_MAX; + + /* + * This may be called on clean wb's and proportional distribution + * may not make sense, just use the original @nr_pages in those + * cases. In general, we wanna err on the side of writing more. + */ + if (!tot_bw || this_bw >= tot_bw) + return nr_pages; + else + return DIV_ROUND_UP_ULL((u64)nr_pages * this_bw, tot_bw); +} + +/** + * wb_clone_and_queue_work - clone a wb_writeback_work and issue it to a wb + * @wb: target bdi_writeback + * @base_work: source wb_writeback_work + * + * Try to make a clone of @base_work and issue it to @wb. If cloning + * succeeds, %true is returned; otherwise, @base_work is issued directly + * and %false is returned. In the latter case, the caller is required to + * wait for @base_work's completion using wb_wait_for_single_work(). + * + * A clone is auto-freed on completion. @base_work never is. + */ +static bool wb_clone_and_queue_work(struct bdi_writeback *wb, + struct wb_writeback_work *base_work) { struct wb_writeback_work *work; + work = kmalloc(sizeof(*work), GFP_ATOMIC); + if (work) { + *work = *base_work; + work->auto_free = 1; + work->single_wait = 0; + } else { + work = base_work; + work->auto_free = 0; + work->single_wait = 1; + } + work->single_done = 0; + wb_queue_work(wb, work); + return work != base_work; +} + +/** + * bdi_split_work_to_wbs - split a wb_writeback_work to all wb's of a bdi + * @bdi: target backing_dev_info + * @base_work: wb_writeback_work to issue + * @skip_if_busy: skip wb's which already have writeback in progress + * + * Split and issue @base_work to all wb's (bdi_writeback's) of @bdi which + * have dirty inodes. If @base_work->nr_page isn't %LONG_MAX, it's + * distributed to the busy wbs according to each wb's proportion in the + * total active write bandwidth of @bdi. + */ +static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, + struct wb_writeback_work *base_work, + bool skip_if_busy) +{ + long nr_pages = base_work->nr_pages; + int next_blkcg_id = 0; + struct bdi_writeback *wb; + struct wb_iter iter; + + might_sleep(); + + if (!bdi_has_dirty_io(bdi)) + return; +restart: + rcu_read_lock(); + bdi_for_each_wb(wb, bdi, &iter, next_blkcg_id) { + if (!wb_has_dirty_io(wb) || + (skip_if_busy && writeback_in_progress(wb))) + continue; + + base_work->nr_pages = wb_split_bdi_pages(wb, nr_pages); + if (!wb_clone_and_queue_work(wb, base_work)) { + next_blkcg_id = wb->blkcg_css->id + 1; + rcu_read_unlock(); + wb_wait_for_single_work(bdi, base_work); + goto restart; + } + } + rcu_read_unlock(); +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static struct bdi_writeback * +locked_inode_to_wb_and_lock_list(struct inode *inode) + __releases(&inode->i_lock) + __acquires(&wb->list_lock) +{ + struct bdi_writeback *wb = inode_to_wb(inode); + + spin_unlock(&inode->i_lock); + spin_lock(&wb->list_lock); + return wb; +} + +static struct bdi_writeback *inode_to_wb_and_lock_list(struct inode *inode) + __acquires(&wb->list_lock) +{ + struct bdi_writeback *wb = inode_to_wb(inode); + + spin_lock(&wb->list_lock); + return wb; +} + +static long wb_split_bdi_pages(struct bdi_writeback *wb, long nr_pages) +{ + return nr_pages; +} + +static void bdi_split_work_to_wbs(struct backing_dev_info *bdi, + struct wb_writeback_work *base_work, + bool skip_if_busy) +{ + might_sleep(); + + if (bdi_has_dirty_io(bdi) && + (!skip_if_busy || !writeback_in_progress(&bdi->wb))) { + base_work->auto_free = 0; + base_work->single_wait = 0; + base_work->single_done = 0; + wb_queue_work(&bdi->wb, base_work); + } +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + +void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, + bool range_cyclic, enum wb_reason reason) +{ + struct wb_writeback_work *work; + + if (!wb_has_dirty_io(wb)) + return; + /* * This is WB_SYNC_NONE writeback, so if allocation fails just * wakeup the thread for old dirty data writeback */ work = kzalloc(sizeof(*work), GFP_ATOMIC); if (!work) { - trace_writeback_nowork(bdi); - bdi_wakeup_thread(bdi); + trace_writeback_nowork(wb->bdi); + wb_wakeup(wb); return; } @@ -155,46 +933,29 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, work->nr_pages = nr_pages; work->range_cyclic = range_cyclic; work->reason = reason; + work->auto_free = 1; - bdi_queue_work(bdi, work); + wb_queue_work(wb, work); } /** - * bdi_start_writeback - start writeback - * @bdi: the backing device to write from - * @nr_pages: the number of pages to write - * @reason: reason why some writeback work was initiated - * - * Description: - * This does WB_SYNC_NONE opportunistic writeback. The IO is only - * started when this function returns, we make no guarantees on - * completion. Caller need not hold sb s_umount semaphore. - * - */ -void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, - enum wb_reason reason) -{ - __bdi_start_writeback(bdi, nr_pages, true, reason); -} - -/** - * bdi_start_background_writeback - start background writeback - * @bdi: the backing device to write from + * wb_start_background_writeback - start background writeback + * @wb: bdi_writback to write from * * Description: * This makes sure WB_SYNC_NONE background writeback happens. When - * this function returns, it is only guaranteed that for given BDI + * this function returns, it is only guaranteed that for given wb * some IO is happening if we are over background dirty threshold. * Caller need not hold sb s_umount semaphore. */ -void bdi_start_background_writeback(struct backing_dev_info *bdi) +void wb_start_background_writeback(struct bdi_writeback *wb) { /* * We just wake up the flusher thread. It will perform background * writeback as soon as there is no other work to do. */ - trace_writeback_wake_background(bdi); - bdi_wakeup_thread(bdi); + trace_writeback_wake_background(wb->bdi); + wb_wakeup(wb); } /* @@ -202,11 +963,11 @@ void bdi_start_background_writeback(struct backing_dev_info *bdi) */ void inode_wb_list_del(struct inode *inode) { - struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb; - spin_lock(&bdi->wb.list_lock); - list_del_init(&inode->i_wb_list); - spin_unlock(&bdi->wb.list_lock); + wb = inode_to_wb_and_lock_list(inode); + inode_wb_list_del_locked(inode, wb); + spin_unlock(&wb->list_lock); } /* @@ -220,7 +981,6 @@ void inode_wb_list_del(struct inode *inode) */ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) { - assert_spin_locked(&wb->list_lock); if (!list_empty(&wb->b_dirty)) { struct inode *tail; @@ -228,7 +988,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) if (time_before(inode->dirtied_when, tail->dirtied_when)) inode->dirtied_when = jiffies; } - list_move(&inode->i_wb_list, &wb->b_dirty); + inode_wb_list_move_locked(inode, wb, &wb->b_dirty); } /* @@ -236,8 +996,7 @@ static void redirty_tail(struct inode *inode, struct bdi_writeback *wb) */ static void requeue_io(struct inode *inode, struct bdi_writeback *wb) { - assert_spin_locked(&wb->list_lock); - list_move(&inode->i_wb_list, &wb->b_more_io); + inode_wb_list_move_locked(inode, wb, &wb->b_more_io); } static void inode_sync_complete(struct inode *inode) @@ -346,6 +1105,8 @@ static void queue_io(struct bdi_writeback *wb, struct wb_writeback_work *work) moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, 0, work); moved += move_expired_inodes(&wb->b_dirty_time, &wb->b_io, EXPIRE_DIRTY_ATIME, work); + if (moved) + wb_io_lists_populated(wb); trace_writeback_queue_io(wb, work, moved); } @@ -471,10 +1232,10 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, redirty_tail(inode, wb); } else if (inode->i_state & I_DIRTY_TIME) { inode->dirtied_when = jiffies; - list_move(&inode->i_wb_list, &wb->b_dirty_time); + inode_wb_list_move_locked(inode, wb, &wb->b_dirty_time); } else { /* The inode is clean. Remove from writeback lists. */ - list_del_init(&inode->i_wb_list); + inode_wb_list_del_locked(inode, wb); } } @@ -605,10 +1366,11 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb, !mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))) goto out; inode->i_state |= I_SYNC; - spin_unlock(&inode->i_lock); + wbc_attach_and_unlock_inode(wbc, inode); ret = __writeback_single_inode(inode, wbc); + wbc_detach_inode(wbc); spin_lock(&wb->list_lock); spin_lock(&inode->i_lock); /* @@ -616,7 +1378,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb, * touch it. See comment above for explanation. */ if (!(inode->i_state & I_DIRTY_ALL)) - list_del_init(&inode->i_wb_list); + inode_wb_list_del_locked(inode, wb); spin_unlock(&wb->list_lock); inode_sync_complete(inode); out: @@ -624,7 +1386,7 @@ out: return ret; } -static long writeback_chunk_size(struct backing_dev_info *bdi, +static long writeback_chunk_size(struct bdi_writeback *wb, struct wb_writeback_work *work) { long pages; @@ -645,8 +1407,8 @@ static long writeback_chunk_size(struct backing_dev_info *bdi, if (work->sync_mode == WB_SYNC_ALL || work->tagged_writepages) pages = LONG_MAX; else { - pages = min(bdi->avg_write_bandwidth / 2, - global_dirty_limit / DIRTY_SCOPE); + pages = min(wb->avg_write_bandwidth / 2, + global_wb_domain.dirty_limit / DIRTY_SCOPE); pages = min(pages, work->nr_pages); pages = round_down(pages + MIN_WRITEBACK_PAGES, MIN_WRITEBACK_PAGES); @@ -741,9 +1503,9 @@ static long writeback_sb_inodes(struct super_block *sb, continue; } inode->i_state |= I_SYNC; - spin_unlock(&inode->i_lock); + wbc_attach_and_unlock_inode(&wbc, inode); - write_chunk = writeback_chunk_size(wb->bdi, work); + write_chunk = writeback_chunk_size(wb, work); wbc.nr_to_write = write_chunk; wbc.pages_skipped = 0; @@ -753,6 +1515,7 @@ static long writeback_sb_inodes(struct super_block *sb, */ __writeback_single_inode(inode, &wbc); + wbc_detach_inode(&wbc); work->nr_pages -= write_chunk - wbc.nr_to_write; wrote += write_chunk - wbc.nr_to_write; spin_lock(&wb->list_lock); @@ -830,33 +1593,6 @@ static long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages, return nr_pages - work.nr_pages; } -static bool over_bground_thresh(struct backing_dev_info *bdi) -{ - unsigned long background_thresh, dirty_thresh; - - global_dirty_limits(&background_thresh, &dirty_thresh); - - if (global_page_state(NR_FILE_DIRTY) + - global_page_state(NR_UNSTABLE_NFS) > background_thresh) - return true; - - if (bdi_stat(bdi, BDI_RECLAIMABLE) > - bdi_dirty_limit(bdi, background_thresh)) - return true; - - return false; -} - -/* - * Called under wb->list_lock. If there are multiple wb per bdi, - * only the flusher working on the first wb should do it. - */ -static void wb_update_bandwidth(struct bdi_writeback *wb, - unsigned long start_time) -{ - __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time); -} - /* * Explicit flushing or periodic writeback of "old" data. * @@ -899,14 +1635,14 @@ static long wb_writeback(struct bdi_writeback *wb, * after the other works are all done. */ if ((work->for_background || work->for_kupdate) && - !list_empty(&wb->bdi->work_list)) + !list_empty(&wb->work_list)) break; /* * For background writeout, stop when we are below the * background dirty threshold */ - if (work->for_background && !over_bground_thresh(wb->bdi)) + if (work->for_background && !wb_over_bg_thresh(wb)) break; /* @@ -970,18 +1706,17 @@ static long wb_writeback(struct bdi_writeback *wb, /* * Return the next wb_writeback_work struct that hasn't been processed yet. */ -static struct wb_writeback_work * -get_next_work_item(struct backing_dev_info *bdi) +static struct wb_writeback_work *get_next_work_item(struct bdi_writeback *wb) { struct wb_writeback_work *work = NULL; - spin_lock_bh(&bdi->wb_lock); - if (!list_empty(&bdi->work_list)) { - work = list_entry(bdi->work_list.next, + spin_lock_bh(&wb->work_lock); + if (!list_empty(&wb->work_list)) { + work = list_entry(wb->work_list.next, struct wb_writeback_work, list); list_del_init(&work->list); } - spin_unlock_bh(&bdi->wb_lock); + spin_unlock_bh(&wb->work_lock); return work; } @@ -998,7 +1733,7 @@ static unsigned long get_nr_dirty_pages(void) static long wb_check_background_flush(struct bdi_writeback *wb) { - if (over_bground_thresh(wb->bdi)) { + if (wb_over_bg_thresh(wb)) { struct wb_writeback_work work = { .nr_pages = LONG_MAX, @@ -1053,25 +1788,33 @@ static long wb_check_old_data_flush(struct bdi_writeback *wb) */ static long wb_do_writeback(struct bdi_writeback *wb) { - struct backing_dev_info *bdi = wb->bdi; struct wb_writeback_work *work; long wrote = 0; - set_bit(BDI_writeback_running, &wb->bdi->state); - while ((work = get_next_work_item(bdi)) != NULL) { + set_bit(WB_writeback_running, &wb->state); + while ((work = get_next_work_item(wb)) != NULL) { + struct wb_completion *done = work->done; + bool need_wake_up = false; - trace_writeback_exec(bdi, work); + trace_writeback_exec(wb->bdi, work); wrote += wb_writeback(wb, work); - /* - * Notify the caller of completion if this is a synchronous - * work item, otherwise just free it. - */ - if (work->done) - complete(work->done); - else + if (work->single_wait) { + WARN_ON_ONCE(work->auto_free); + /* paired w/ rmb in wb_wait_for_single_work() */ + smp_wmb(); + work->single_done = 1; + need_wake_up = true; + } else if (work->auto_free) { kfree(work); + } + + if (done && atomic_dec_and_test(&done->cnt)) + need_wake_up = true; + + if (need_wake_up) + wake_up_all(&wb->bdi->wb_waitq); } /* @@ -1079,7 +1822,7 @@ static long wb_do_writeback(struct bdi_writeback *wb) */ wrote += wb_check_old_data_flush(wb); wrote += wb_check_background_flush(wb); - clear_bit(BDI_writeback_running, &wb->bdi->state); + clear_bit(WB_writeback_running, &wb->state); return wrote; } @@ -1088,43 +1831,42 @@ static long wb_do_writeback(struct bdi_writeback *wb) * Handle writeback of dirty data for the device backed by this bdi. Also * reschedules periodically and does kupdated style flushing. */ -void bdi_writeback_workfn(struct work_struct *work) +void wb_workfn(struct work_struct *work) { struct bdi_writeback *wb = container_of(to_delayed_work(work), struct bdi_writeback, dwork); - struct backing_dev_info *bdi = wb->bdi; long pages_written; - set_worker_desc("flush-%s", dev_name(bdi->dev)); + set_worker_desc("flush-%s", dev_name(wb->bdi->dev)); current->flags |= PF_SWAPWRITE; if (likely(!current_is_workqueue_rescuer() || - !test_bit(BDI_registered, &bdi->state))) { + !test_bit(WB_registered, &wb->state))) { /* - * The normal path. Keep writing back @bdi until its + * The normal path. Keep writing back @wb until its * work_list is empty. Note that this path is also taken - * if @bdi is shutting down even when we're running off the + * if @wb is shutting down even when we're running off the * rescuer as work_list needs to be drained. */ do { pages_written = wb_do_writeback(wb); trace_writeback_pages_written(pages_written); - } while (!list_empty(&bdi->work_list)); + } while (!list_empty(&wb->work_list)); } else { /* * bdi_wq can't get enough workers and we're running off * the emergency worker. Don't hog it. Hopefully, 1024 is * enough for efficient IO. */ - pages_written = writeback_inodes_wb(&bdi->wb, 1024, + pages_written = writeback_inodes_wb(wb, 1024, WB_REASON_FORKER_THREAD); trace_writeback_pages_written(pages_written); } - if (!list_empty(&bdi->work_list)) + if (!list_empty(&wb->work_list)) mod_delayed_work(bdi_wq, &wb->dwork, 0); else if (wb_has_dirty_io(wb) && dirty_writeback_interval) - bdi_wakeup_thread_delayed(bdi); + wb_wakeup_delayed(wb); current->flags &= ~PF_SWAPWRITE; } @@ -1142,9 +1884,15 @@ void wakeup_flusher_threads(long nr_pages, enum wb_reason reason) rcu_read_lock(); list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { + struct bdi_writeback *wb; + struct wb_iter iter; + if (!bdi_has_dirty_io(bdi)) continue; - __bdi_start_writeback(bdi, nr_pages, false, reason); + + bdi_for_each_wb(wb, bdi, &iter, 0) + wb_start_writeback(wb, wb_split_bdi_pages(wb, nr_pages), + false, reason); } rcu_read_unlock(); } @@ -1173,9 +1921,12 @@ static void wakeup_dirtytime_writeback(struct work_struct *w) rcu_read_lock(); list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) { - if (list_empty(&bdi->wb.b_dirty_time)) - continue; - bdi_wakeup_thread(bdi); + struct bdi_writeback *wb; + struct wb_iter iter; + + bdi_for_each_wb(wb, bdi, &iter, 0) + if (!list_empty(&bdi->wb.b_dirty_time)) + wb_wakeup(&bdi->wb); } rcu_read_unlock(); schedule_delayed_work(&dirtytime_work, dirtytime_expire_interval * HZ); @@ -1249,7 +2000,6 @@ static noinline void block_dump___mark_inode_dirty(struct inode *inode) void __mark_inode_dirty(struct inode *inode, int flags) { struct super_block *sb = inode->i_sb; - struct backing_dev_info *bdi = NULL; int dirtytime; trace_writeback_mark_inode_dirty(inode, flags); @@ -1289,6 +2039,8 @@ void __mark_inode_dirty(struct inode *inode, int flags) if ((inode->i_state & flags) != flags) { const int was_dirty = inode->i_state & I_DIRTY; + inode_attach_wb(inode, NULL); + if (flags & I_DIRTY_INODE) inode->i_state &= ~I_DIRTY_TIME; inode->i_state |= flags; @@ -1317,38 +2069,39 @@ void __mark_inode_dirty(struct inode *inode, int flags) * reposition it (that would break b_dirty time-ordering). */ if (!was_dirty) { + struct bdi_writeback *wb; + struct list_head *dirty_list; bool wakeup_bdi = false; - bdi = inode_to_bdi(inode); - spin_unlock(&inode->i_lock); - spin_lock(&bdi->wb.list_lock); - if (bdi_cap_writeback_dirty(bdi)) { - WARN(!test_bit(BDI_registered, &bdi->state), - "bdi-%s not registered\n", bdi->name); + wb = locked_inode_to_wb_and_lock_list(inode); - /* - * If this is the first dirty inode for this - * bdi, we have to wake-up the corresponding - * bdi thread to make sure background - * write-back happens later. - */ - if (!wb_has_dirty_io(&bdi->wb)) - wakeup_bdi = true; - } + WARN(bdi_cap_writeback_dirty(wb->bdi) && + !test_bit(WB_registered, &wb->state), + "bdi-%s not registered\n", wb->bdi->name); inode->dirtied_when = jiffies; if (dirtytime) inode->dirtied_time_when = jiffies; + if (inode->i_state & (I_DIRTY_INODE | I_DIRTY_PAGES)) - list_move(&inode->i_wb_list, &bdi->wb.b_dirty); + dirty_list = &wb->b_dirty; else - list_move(&inode->i_wb_list, - &bdi->wb.b_dirty_time); - spin_unlock(&bdi->wb.list_lock); + dirty_list = &wb->b_dirty_time; + + wakeup_bdi = inode_wb_list_move_locked(inode, wb, + dirty_list); + + spin_unlock(&wb->list_lock); trace_writeback_dirty_inode_enqueue(inode); - if (wakeup_bdi) - bdi_wakeup_thread_delayed(bdi); + /* + * If this is the first dirty inode for this bdi, + * we have to wake-up the corresponding bdi thread + * to make sure background write-back happens + * later. + */ + if (bdi_cap_writeback_dirty(wb->bdi) && wakeup_bdi) + wb_wakeup_delayed(wb); return; } } @@ -1411,6 +2164,28 @@ static void wait_sb_inodes(struct super_block *sb) iput(old_inode); } +static void __writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, + enum wb_reason reason, bool skip_if_busy) +{ + DEFINE_WB_COMPLETION_ONSTACK(done); + struct wb_writeback_work work = { + .sb = sb, + .sync_mode = WB_SYNC_NONE, + .tagged_writepages = 1, + .done = &done, + .nr_pages = nr, + .reason = reason, + }; + struct backing_dev_info *bdi = sb->s_bdi; + + if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) + return; + WARN_ON(!rwsem_is_locked(&sb->s_umount)); + + bdi_split_work_to_wbs(sb->s_bdi, &work, skip_if_busy); + wb_wait_for_completion(bdi, &done); +} + /** * writeback_inodes_sb_nr - writeback dirty inodes from given super_block * @sb: the superblock @@ -1425,21 +2200,7 @@ void writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, enum wb_reason reason) { - DECLARE_COMPLETION_ONSTACK(done); - struct wb_writeback_work work = { - .sb = sb, - .sync_mode = WB_SYNC_NONE, - .tagged_writepages = 1, - .done = &done, - .nr_pages = nr, - .reason = reason, - }; - - if (sb->s_bdi == &noop_backing_dev_info) - return; - WARN_ON(!rwsem_is_locked(&sb->s_umount)); - bdi_queue_work(sb->s_bdi, &work); - wait_for_completion(&done); + __writeback_inodes_sb_nr(sb, nr, reason, false); } EXPORT_SYMBOL(writeback_inodes_sb_nr); @@ -1467,19 +2228,15 @@ EXPORT_SYMBOL(writeback_inodes_sb); * Invoke writeback_inodes_sb_nr if no writeback is currently underway. * Returns 1 if writeback was started, 0 if not. */ -int try_to_writeback_inodes_sb_nr(struct super_block *sb, - unsigned long nr, - enum wb_reason reason) +bool try_to_writeback_inodes_sb_nr(struct super_block *sb, unsigned long nr, + enum wb_reason reason) { - if (writeback_in_progress(sb->s_bdi)) - return 1; - if (!down_read_trylock(&sb->s_umount)) - return 0; + return false; - writeback_inodes_sb_nr(sb, nr, reason); + __writeback_inodes_sb_nr(sb, nr, reason, true); up_read(&sb->s_umount); - return 1; + return true; } EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr); @@ -1491,7 +2248,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb_nr); * Implement by try_to_writeback_inodes_sb_nr() * Returns 1 if writeback was started, 0 if not. */ -int try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) +bool try_to_writeback_inodes_sb(struct super_block *sb, enum wb_reason reason) { return try_to_writeback_inodes_sb_nr(sb, get_nr_dirty_pages(), reason); } @@ -1506,7 +2263,7 @@ EXPORT_SYMBOL(try_to_writeback_inodes_sb); */ void sync_inodes_sb(struct super_block *sb) { - DECLARE_COMPLETION_ONSTACK(done); + DEFINE_WB_COMPLETION_ONSTACK(done); struct wb_writeback_work work = { .sb = sb, .sync_mode = WB_SYNC_ALL, @@ -1516,14 +2273,15 @@ void sync_inodes_sb(struct super_block *sb) .reason = WB_REASON_SYNC, .for_sync = 1, }; + struct backing_dev_info *bdi = sb->s_bdi; /* Nothing to do? */ - if (sb->s_bdi == &noop_backing_dev_info) + if (!bdi_has_dirty_io(bdi) || bdi == &noop_backing_dev_info) return; WARN_ON(!rwsem_is_locked(&sb->s_umount)); - bdi_queue_work(sb->s_bdi, &work); - wait_for_completion(&done); + bdi_split_work_to_wbs(bdi, &work, false); + wb_wait_for_completion(bdi, &done); wait_sb_inodes(sb); } diff --git a/fs/fuse/file.c b/fs/fuse/file.c index 5ef05b5c4cff..8c5e2fa68835 100644 --- a/fs/fuse/file.c +++ b/fs/fuse/file.c @@ -1445,9 +1445,9 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req) list_del(&req->writepages_entry); for (i = 0; i < req->num_pages; i++) { - dec_bdi_stat(bdi, BDI_WRITEBACK); + dec_wb_stat(&bdi->wb, WB_WRITEBACK); dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP); - bdi_writeout_inc(bdi); + wb_writeout_inc(&bdi->wb); } wake_up(&fi->page_waitq); } @@ -1634,7 +1634,7 @@ static int fuse_writepage_locked(struct page *page) req->end = fuse_writepage_end; req->inode = inode; - inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK); + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); spin_lock(&fc->lock); @@ -1749,9 +1749,9 @@ static bool fuse_writepage_in_flight(struct fuse_req *new_req, copy_highpage(old_req->pages[0], page); spin_unlock(&fc->lock); - dec_bdi_stat(bdi, BDI_WRITEBACK); + dec_wb_stat(&bdi->wb, WB_WRITEBACK); dec_zone_page_state(page, NR_WRITEBACK_TEMP); - bdi_writeout_inc(bdi); + wb_writeout_inc(&bdi->wb); fuse_writepage_free(fc, new_req); fuse_request_free(new_req); goto out; @@ -1848,7 +1848,7 @@ static int fuse_writepages_fill(struct page *page, req->page_descs[req->num_pages].offset = 0; req->page_descs[req->num_pages].length = PAGE_SIZE; - inc_bdi_stat(inode_to_bdi(inode), BDI_WRITEBACK); + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK); inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP); err = 0; diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c index c18b49dc5d4f..894fb01a91da 100644 --- a/fs/gfs2/super.c +++ b/fs/gfs2/super.c @@ -748,7 +748,7 @@ static int gfs2_write_inode(struct inode *inode, struct writeback_control *wbc) if (wbc->sync_mode == WB_SYNC_ALL) gfs2_log_flush(GFS2_SB(inode), ip->i_gl, NORMAL_FLUSH); - if (bdi->dirty_exceeded) + if (bdi->wb.dirty_exceeded) gfs2_ail1_flush(sdp, wbc); else filemap_fdatawrite(metamapping); diff --git a/fs/hfs/super.c b/fs/hfs/super.c index 410b65eea683..4574fdd3d421 100644 --- a/fs/hfs/super.c +++ b/fs/hfs/super.c @@ -14,6 +14,7 @@ #include <linux/module.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/mount.h> #include <linux/init.h> #include <linux/nls.h> diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c index 593af2fdcc2d..7302d96ae8bf 100644 --- a/fs/hfsplus/super.c +++ b/fs/hfsplus/super.c @@ -11,6 +11,7 @@ #include <linux/init.h> #include <linux/pagemap.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/fs.h> #include <linux/slab.h> #include <linux/vfs.h> diff --git a/fs/inode.c b/fs/inode.c index 6e342cadef81..a049dc467c1a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -223,6 +223,7 @@ EXPORT_SYMBOL(free_inode_nonrcu); void __destroy_inode(struct inode *inode) { BUG_ON(inode_has_buffers(inode)); + inode_detach_wb(inode); security_inode_free(inode); fsnotify_inode_delete(inode); locks_free_lock_context(inode->i_flctx); diff --git a/fs/mpage.c b/fs/mpage.c index 3e79220babac..ca0244b69de8 100644 --- a/fs/mpage.c +++ b/fs/mpage.c @@ -605,6 +605,8 @@ alloc_new: bio_get_nr_vecs(bdev), GFP_NOFS|__GFP_HIGH); if (bio == NULL) goto confused; + + wbc_init_bio(wbc, bio); } /* @@ -612,6 +614,7 @@ alloc_new: * the confused fail path above (OOM) will be very confused when * it finds all bh marked clean (i.e. it will not write anything) */ + wbc_account_io(wbc, page, PAGE_SIZE); length = first_unmapped << blkbits; if (bio_add_page(bio, page, length, 0) < length) { bio = mpage_bio_submit(WRITE, bio); diff --git a/fs/nfs/filelayout/filelayout.c b/fs/nfs/filelayout/filelayout.c index fb1fb2774d34..02ec07973bc4 100644 --- a/fs/nfs/filelayout/filelayout.c +++ b/fs/nfs/filelayout/filelayout.c @@ -32,6 +32,7 @@ #include <linux/nfs_fs.h> #include <linux/nfs_page.h> #include <linux/module.h> +#include <linux/backing-dev.h> #include <linux/sunrpc/metrics.h> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h index 9e6475bc5ba2..7e3c4604bea8 100644 --- a/fs/nfs/internal.h +++ b/fs/nfs/internal.h @@ -607,7 +607,7 @@ void nfs_mark_page_unstable(struct page *page) struct inode *inode = page_file_mapping(page)->host; inc_zone_page_state(page, NR_UNSTABLE_NFS); - inc_bdi_stat(inode_to_bdi(inode), BDI_RECLAIMABLE); + inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE); __mark_inode_dirty(inode, I_DIRTY_DATASYNC); } diff --git a/fs/nfs/write.c b/fs/nfs/write.c index d9851a6a2813..5051926f6356 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -853,7 +853,8 @@ static void nfs_clear_page_commit(struct page *page) { dec_zone_page_state(page, NR_UNSTABLE_NFS); - dec_bdi_stat(inode_to_bdi(page_file_mapping(page)->host), BDI_RECLAIMABLE); + dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb, + WB_RECLAIMABLE); } /* Called holding inode (/cinfo) lock */ diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c index dc3a9efdaab8..42468e5ab3e7 100644 --- a/fs/nilfs2/segbuf.c +++ b/fs/nilfs2/segbuf.c @@ -343,11 +343,6 @@ static void nilfs_end_bio_write(struct bio *bio, int err) const int uptodate = test_bit(BIO_UPTODATE, &bio->bi_flags); struct nilfs_segment_buffer *segbuf = bio->bi_private; - if (err == -EOPNOTSUPP) { - set_bit(BIO_EOPNOTSUPP, &bio->bi_flags); - /* to be detected by nilfs_segbuf_submit_bio() */ - } - if (!uptodate) atomic_inc(&segbuf->sb_err); @@ -374,15 +369,8 @@ static int nilfs_segbuf_submit_bio(struct nilfs_segment_buffer *segbuf, bio->bi_end_io = nilfs_end_bio_write; bio->bi_private = segbuf; - bio_get(bio); submit_bio(mode, bio); segbuf->sb_nbio++; - if (bio_flagged(bio, BIO_EOPNOTSUPP)) { - bio_put(bio); - err = -EOPNOTSUPP; - goto failed; - } - bio_put(bio); wi->bio = NULL; wi->rest_blocks -= wi->end - wi->start; diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index d8b670cbd909..8f1feca89fb0 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -37,6 +37,7 @@ #include <linux/falloc.h> #include <linux/quotaops.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <cluster/masklog.h> diff --git a/fs/reiserfs/super.c b/fs/reiserfs/super.c index cf6fa25f884b..9a895b415711 100644 --- a/fs/reiserfs/super.c +++ b/fs/reiserfs/super.c @@ -21,6 +21,7 @@ #include "xattr.h" #include <linux/init.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/buffer_head.h> #include <linux/exportfs.h> #include <linux/quotaops.h> diff --git a/fs/ufs/super.c b/fs/ufs/super.c index dc33f9416340..250579a80d90 100644 --- a/fs/ufs/super.c +++ b/fs/ufs/super.c @@ -80,6 +80,7 @@ #include <linux/stat.h> #include <linux/string.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/init.h> #include <linux/parser.h> #include <linux/buffer_head.h> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c index a56960dd1684..feb7e3f01e8e 100644 --- a/fs/xfs/xfs_aops.c +++ b/fs/xfs/xfs_aops.c @@ -1874,6 +1874,7 @@ xfs_vm_set_page_dirty( loff_t end_offset; loff_t offset; int newly_dirty; + struct mem_cgroup *memcg; if (unlikely(!mapping)) return !TestSetPageDirty(page); @@ -1893,6 +1894,11 @@ xfs_vm_set_page_dirty( offset += 1 << inode->i_blkbits; } while (bh != head); } + /* + * Use mem_group_begin_page_stat() to keep PageDirty synchronized with + * per-memcg dirty page counters. + */ + memcg = mem_cgroup_begin_page_stat(page); newly_dirty = !TestSetPageDirty(page); spin_unlock(&mapping->private_lock); @@ -1903,13 +1909,15 @@ xfs_vm_set_page_dirty( spin_lock_irqsave(&mapping->tree_lock, flags); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(!PageUptodate(page)); - account_page_dirtied(page, mapping); + account_page_dirtied(page, mapping, memcg); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } spin_unlock_irqrestore(&mapping->tree_lock, flags); - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } + mem_cgroup_end_page_stat(memcg); + if (newly_dirty) + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); return newly_dirty; } diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 3b7591224f4a..7c62fca53e2f 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -41,6 +41,7 @@ #include <linux/dcache.h> #include <linux/falloc.h> #include <linux/pagevec.h> +#include <linux/backing-dev.h> static const struct vm_operations_struct xfs_file_vm_ops; diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h new file mode 100644 index 000000000000..a23209b43842 --- /dev/null +++ b/include/linux/backing-dev-defs.h @@ -0,0 +1,256 @@ +#ifndef __LINUX_BACKING_DEV_DEFS_H +#define __LINUX_BACKING_DEV_DEFS_H + +#include <linux/list.h> +#include <linux/radix-tree.h> +#include <linux/rbtree.h> +#include <linux/spinlock.h> +#include <linux/percpu_counter.h> +#include <linux/percpu-refcount.h> +#include <linux/flex_proportions.h> +#include <linux/timer.h> +#include <linux/workqueue.h> + +struct page; +struct device; +struct dentry; + +/* + * Bits in bdi_writeback.state + */ +enum wb_state { + WB_registered, /* bdi_register() was done */ + WB_writeback_running, /* Writeback is in progress */ + WB_has_dirty_io, /* Dirty inodes on ->b_{dirty|io|more_io} */ +}; + +enum wb_congested_state { + WB_async_congested, /* The async (write) queue is getting full */ + WB_sync_congested, /* The sync queue is getting full */ +}; + +typedef int (congested_fn)(void *, int); + +enum wb_stat_item { + WB_RECLAIMABLE, + WB_WRITEBACK, + WB_DIRTIED, + WB_WRITTEN, + NR_WB_STAT_ITEMS +}; + +#define WB_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) + +/* + * For cgroup writeback, multiple wb's may map to the same blkcg. Those + * wb's can operate mostly independently but should share the congested + * state. To facilitate such sharing, the congested state is tracked using + * the following struct which is created on demand, indexed by blkcg ID on + * its bdi, and refcounted. + */ +struct bdi_writeback_congested { + unsigned long state; /* WB_[a]sync_congested flags */ + atomic_t refcnt; /* nr of attached wb's and blkg */ + +#ifdef CONFIG_CGROUP_WRITEBACK + struct backing_dev_info *bdi; /* the associated bdi */ + int blkcg_id; /* ID of the associated blkcg */ + struct rb_node rb_node; /* on bdi->cgwb_congestion_tree */ +#endif +}; + +/* + * Each wb (bdi_writeback) can perform writeback operations, is measured + * and throttled, independently. Without cgroup writeback, each bdi + * (bdi_writeback) is served by its embedded bdi->wb. + * + * On the default hierarchy, blkcg implicitly enables memcg. This allows + * using memcg's page ownership for attributing writeback IOs, and every + * memcg - blkcg combination can be served by its own wb by assigning a + * dedicated wb to each memcg, which enables isolation across different + * cgroups and propagation of IO back pressure down from the IO layer upto + * the tasks which are generating the dirty pages to be written back. + * + * A cgroup wb is indexed on its bdi by the ID of the associated memcg, + * refcounted with the number of inodes attached to it, and pins the memcg + * and the corresponding blkcg. As the corresponding blkcg for a memcg may + * change as blkcg is disabled and enabled higher up in the hierarchy, a wb + * is tested for blkcg after lookup and removed from index on mismatch so + * that a new wb for the combination can be created. + */ +struct bdi_writeback { + struct backing_dev_info *bdi; /* our parent bdi */ + + unsigned long state; /* Always use atomic bitops on this */ + unsigned long last_old_flush; /* last old data flush */ + + struct list_head b_dirty; /* dirty inodes */ + struct list_head b_io; /* parked for writeback */ + struct list_head b_more_io; /* parked for more writeback */ + struct list_head b_dirty_time; /* time stamps are dirty */ + spinlock_t list_lock; /* protects the b_* lists */ + + struct percpu_counter stat[NR_WB_STAT_ITEMS]; + + struct bdi_writeback_congested *congested; + + unsigned long bw_time_stamp; /* last time write bw is updated */ + unsigned long dirtied_stamp; + unsigned long written_stamp; /* pages written at bw_time_stamp */ + unsigned long write_bandwidth; /* the estimated write bandwidth */ + unsigned long avg_write_bandwidth; /* further smoothed write bw, > 0 */ + + /* + * The base dirty throttle rate, re-calculated on every 200ms. + * All the bdi tasks' dirty rate will be curbed under it. + * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit + * in small steps and is much more smooth/stable than the latter. + */ + unsigned long dirty_ratelimit; + unsigned long balanced_dirty_ratelimit; + + struct fprop_local_percpu completions; + int dirty_exceeded; + + spinlock_t work_lock; /* protects work_list & dwork scheduling */ + struct list_head work_list; + struct delayed_work dwork; /* work item used for writeback */ + +#ifdef CONFIG_CGROUP_WRITEBACK + struct percpu_ref refcnt; /* used only for !root wb's */ + struct fprop_local_percpu memcg_completions; + struct cgroup_subsys_state *memcg_css; /* the associated memcg */ + struct cgroup_subsys_state *blkcg_css; /* and blkcg */ + struct list_head memcg_node; /* anchored at memcg->cgwb_list */ + struct list_head blkcg_node; /* anchored at blkcg->cgwb_list */ + + union { + struct work_struct release_work; + struct rcu_head rcu; + }; +#endif +}; + +struct backing_dev_info { + struct list_head bdi_list; + unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ + unsigned int capabilities; /* Device capabilities */ + congested_fn *congested_fn; /* Function pointer if device is md/dm */ + void *congested_data; /* Pointer to aux data for congested func */ + + char *name; + + unsigned int min_ratio; + unsigned int max_ratio, max_prop_frac; + + /* + * Sum of avg_write_bw of wbs with dirty inodes. > 0 if there are + * any dirty wbs, which is depended upon by bdi_has_dirty(). + */ + atomic_long_t tot_write_bandwidth; + + struct bdi_writeback wb; /* the root writeback info for this bdi */ +#ifdef CONFIG_CGROUP_WRITEBACK + struct radix_tree_root cgwb_tree; /* radix tree of active cgroup wbs */ + struct rb_root cgwb_congested_tree; /* their congested states */ + atomic_t usage_cnt; /* counts both cgwbs and cgwb_contested's */ +#else + struct bdi_writeback_congested *wb_congested; +#endif + wait_queue_head_t wb_waitq; + + struct device *dev; + + struct timer_list laptop_mode_wb_timer; + +#ifdef CONFIG_DEBUG_FS + struct dentry *debug_dir; + struct dentry *debug_stats; +#endif +}; + +enum { + BLK_RW_ASYNC = 0, + BLK_RW_SYNC = 1, +}; + +void clear_wb_congested(struct bdi_writeback_congested *congested, int sync); +void set_wb_congested(struct bdi_writeback_congested *congested, int sync); + +static inline void clear_bdi_congested(struct backing_dev_info *bdi, int sync) +{ + clear_wb_congested(bdi->wb.congested, sync); +} + +static inline void set_bdi_congested(struct backing_dev_info *bdi, int sync) +{ + set_wb_congested(bdi->wb.congested, sync); +} + +#ifdef CONFIG_CGROUP_WRITEBACK + +/** + * wb_tryget - try to increment a wb's refcount + * @wb: bdi_writeback to get + */ +static inline bool wb_tryget(struct bdi_writeback *wb) +{ + if (wb != &wb->bdi->wb) + return percpu_ref_tryget(&wb->refcnt); + return true; +} + +/** + * wb_get - increment a wb's refcount + * @wb: bdi_writeback to get + */ +static inline void wb_get(struct bdi_writeback *wb) +{ + if (wb != &wb->bdi->wb) + percpu_ref_get(&wb->refcnt); +} + +/** + * wb_put - decrement a wb's refcount + * @wb: bdi_writeback to put + */ +static inline void wb_put(struct bdi_writeback *wb) +{ + if (wb != &wb->bdi->wb) + percpu_ref_put(&wb->refcnt); +} + +/** + * wb_dying - is a wb dying? + * @wb: bdi_writeback of interest + * + * Returns whether @wb is unlinked and being drained. + */ +static inline bool wb_dying(struct bdi_writeback *wb) +{ + return percpu_ref_is_dying(&wb->refcnt); +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static inline bool wb_tryget(struct bdi_writeback *wb) +{ + return true; +} + +static inline void wb_get(struct bdi_writeback *wb) +{ +} + +static inline void wb_put(struct bdi_writeback *wb) +{ +} + +static inline bool wb_dying(struct bdi_writeback *wb) +{ + return false; +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + +#endif /* __LINUX_BACKING_DEV_DEFS_H */ diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h index d87d8eced064..0fe9df983ab7 100644 --- a/include/linux/backing-dev.h +++ b/include/linux/backing-dev.h @@ -8,106 +8,14 @@ #ifndef _LINUX_BACKING_DEV_H #define _LINUX_BACKING_DEV_H -#include <linux/percpu_counter.h> -#include <linux/log2.h> -#include <linux/flex_proportions.h> #include <linux/kernel.h> #include <linux/fs.h> #include <linux/sched.h> -#include <linux/timer.h> +#include <linux/blkdev.h> #include <linux/writeback.h> -#include <linux/atomic.h> -#include <linux/sysctl.h> -#include <linux/workqueue.h> - -struct page; -struct device; -struct dentry; - -/* - * Bits in backing_dev_info.state - */ -enum bdi_state { - BDI_async_congested, /* The async (write) queue is getting full */ - BDI_sync_congested, /* The sync queue is getting full */ - BDI_registered, /* bdi_register() was done */ - BDI_writeback_running, /* Writeback is in progress */ -}; - -typedef int (congested_fn)(void *, int); - -enum bdi_stat_item { - BDI_RECLAIMABLE, - BDI_WRITEBACK, - BDI_DIRTIED, - BDI_WRITTEN, - NR_BDI_STAT_ITEMS -}; - -#define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids))) - -struct bdi_writeback { - struct backing_dev_info *bdi; /* our parent bdi */ - - unsigned long last_old_flush; /* last old data flush */ - - struct delayed_work dwork; /* work item used for writeback */ - struct list_head b_dirty; /* dirty inodes */ - struct list_head b_io; /* parked for writeback */ - struct list_head b_more_io; /* parked for more writeback */ - struct list_head b_dirty_time; /* time stamps are dirty */ - spinlock_t list_lock; /* protects the b_* lists */ -}; - -struct backing_dev_info { - struct list_head bdi_list; - unsigned long ra_pages; /* max readahead in PAGE_CACHE_SIZE units */ - unsigned long state; /* Always use atomic bitops on this */ - unsigned int capabilities; /* Device capabilities */ - congested_fn *congested_fn; /* Function pointer if device is md/dm */ - void *congested_data; /* Pointer to aux data for congested func */ - - char *name; - - struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS]; - - unsigned long bw_time_stamp; /* last time write bw is updated */ - unsigned long dirtied_stamp; - unsigned long written_stamp; /* pages written at bw_time_stamp */ - unsigned long write_bandwidth; /* the estimated write bandwidth */ - unsigned long avg_write_bandwidth; /* further smoothed write bw */ - - /* - * The base dirty throttle rate, re-calculated on every 200ms. - * All the bdi tasks' dirty rate will be curbed under it. - * @dirty_ratelimit tracks the estimated @balanced_dirty_ratelimit - * in small steps and is much more smooth/stable than the latter. - */ - unsigned long dirty_ratelimit; - unsigned long balanced_dirty_ratelimit; - - struct fprop_local_percpu completions; - int dirty_exceeded; - - unsigned int min_ratio; - unsigned int max_ratio, max_prop_frac; - - struct bdi_writeback wb; /* default writeback info for this bdi */ - spinlock_t wb_lock; /* protects work_list & wb.dwork scheduling */ - - struct list_head work_list; - - struct device *dev; - - struct timer_list laptop_mode_wb_timer; - -#ifdef CONFIG_DEBUG_FS - struct dentry *debug_dir; - struct dentry *debug_stats; -#endif -}; - -struct backing_dev_info *inode_to_bdi(struct inode *inode); +#include <linux/blk-cgroup.h> +#include <linux/backing-dev-defs.h> +#include <linux/slab.h> int __must_check bdi_init(struct backing_dev_info *bdi); void bdi_destroy(struct backing_dev_info *bdi); @@ -117,97 +25,99 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent, const char *fmt, ...); int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev); int __must_check bdi_setup_and_register(struct backing_dev_info *, char *); -void bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages, - enum wb_reason reason); -void bdi_start_background_writeback(struct backing_dev_info *bdi); -void bdi_writeback_workfn(struct work_struct *work); -int bdi_has_dirty_io(struct backing_dev_info *bdi); -void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi); +void wb_start_writeback(struct bdi_writeback *wb, long nr_pages, + bool range_cyclic, enum wb_reason reason); +void wb_start_background_writeback(struct bdi_writeback *wb); +void wb_workfn(struct work_struct *work); +void wb_wakeup_delayed(struct bdi_writeback *wb); extern spinlock_t bdi_lock; extern struct list_head bdi_list; extern struct workqueue_struct *bdi_wq; -static inline int wb_has_dirty_io(struct bdi_writeback *wb) +static inline bool wb_has_dirty_io(struct bdi_writeback *wb) { - return !list_empty(&wb->b_dirty) || - !list_empty(&wb->b_io) || - !list_empty(&wb->b_more_io); + return test_bit(WB_has_dirty_io, &wb->state); +} + +static inline bool bdi_has_dirty_io(struct backing_dev_info *bdi) +{ + /* + * @bdi->tot_write_bandwidth is guaranteed to be > 0 if there are + * any dirty wbs. See wb_update_write_bandwidth(). + */ + return atomic_long_read(&bdi->tot_write_bandwidth); } -static inline void __add_bdi_stat(struct backing_dev_info *bdi, - enum bdi_stat_item item, s64 amount) +static inline void __add_wb_stat(struct bdi_writeback *wb, + enum wb_stat_item item, s64 amount) { - __percpu_counter_add(&bdi->bdi_stat[item], amount, BDI_STAT_BATCH); + __percpu_counter_add(&wb->stat[item], amount, WB_STAT_BATCH); } -static inline void __inc_bdi_stat(struct backing_dev_info *bdi, - enum bdi_stat_item item) +static inline void __inc_wb_stat(struct bdi_writeback *wb, + enum wb_stat_item item) { - __add_bdi_stat(bdi, item, 1); + __add_wb_stat(wb, item, 1); } -static inline void inc_bdi_stat(struct backing_dev_info *bdi, - enum bdi_stat_item item) +static inline void inc_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) { unsigned long flags; local_irq_save(flags); - __inc_bdi_stat(bdi, item); + __inc_wb_stat(wb, item); local_irq_restore(flags); } -static inline void __dec_bdi_stat(struct backing_dev_info *bdi, - enum bdi_stat_item item) +static inline void __dec_wb_stat(struct bdi_writeback *wb, + enum wb_stat_item item) { - __add_bdi_stat(bdi, item, -1); + __add_wb_stat(wb, item, -1); } -static inline void dec_bdi_stat(struct backing_dev_info *bdi, - enum bdi_stat_item item) +static inline void dec_wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) { unsigned long flags; local_irq_save(flags); - __dec_bdi_stat(bdi, item); + __dec_wb_stat(wb, item); local_irq_restore(flags); } -static inline s64 bdi_stat(struct backing_dev_info *bdi, - enum bdi_stat_item item) +static inline s64 wb_stat(struct bdi_writeback *wb, enum wb_stat_item item) { - return percpu_counter_read_positive(&bdi->bdi_stat[item]); + return percpu_counter_read_positive(&wb->stat[item]); } -static inline s64 __bdi_stat_sum(struct backing_dev_info *bdi, - enum bdi_stat_item item) +static inline s64 __wb_stat_sum(struct bdi_writeback *wb, + enum wb_stat_item item) { - return percpu_counter_sum_positive(&bdi->bdi_stat[item]); + return percpu_counter_sum_positive(&wb->stat[item]); } -static inline s64 bdi_stat_sum(struct backing_dev_info *bdi, - enum bdi_stat_item item) +static inline s64 wb_stat_sum(struct bdi_writeback *wb, enum wb_stat_item item) { s64 sum; unsigned long flags; local_irq_save(flags); - sum = __bdi_stat_sum(bdi, item); + sum = __wb_stat_sum(wb, item); local_irq_restore(flags); return sum; } -extern void bdi_writeout_inc(struct backing_dev_info *bdi); +extern void wb_writeout_inc(struct bdi_writeback *wb); /* * maximal error of a stat counter. */ -static inline unsigned long bdi_stat_error(struct backing_dev_info *bdi) +static inline unsigned long wb_stat_error(struct bdi_writeback *wb) { #ifdef CONFIG_SMP - return nr_cpu_ids * BDI_STAT_BATCH; + return nr_cpu_ids * WB_STAT_BATCH; #else return 1; #endif @@ -231,50 +141,57 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio); * BDI_CAP_NO_WRITEBACK: Don't write pages back * BDI_CAP_NO_ACCT_WB: Don't automatically account writeback pages * BDI_CAP_STRICTLIMIT: Keep number of dirty pages below bdi threshold. + * + * BDI_CAP_CGROUP_WRITEBACK: Supports cgroup-aware writeback. */ #define BDI_CAP_NO_ACCT_DIRTY 0x00000001 #define BDI_CAP_NO_WRITEBACK 0x00000002 #define BDI_CAP_NO_ACCT_WB 0x00000004 #define BDI_CAP_STABLE_WRITES 0x00000008 #define BDI_CAP_STRICTLIMIT 0x00000010 +#define BDI_CAP_CGROUP_WRITEBACK 0x00000020 #define BDI_CAP_NO_ACCT_AND_WRITEBACK \ (BDI_CAP_NO_WRITEBACK | BDI_CAP_NO_ACCT_DIRTY | BDI_CAP_NO_ACCT_WB) extern struct backing_dev_info noop_backing_dev_info; -int writeback_in_progress(struct backing_dev_info *bdi); - -static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits) +/** + * writeback_in_progress - determine whether there is writeback in progress + * @wb: bdi_writeback of interest + * + * Determine whether there is writeback waiting to be handled against a + * bdi_writeback. + */ +static inline bool writeback_in_progress(struct bdi_writeback *wb) { - if (bdi->congested_fn) - return bdi->congested_fn(bdi->congested_data, bdi_bits); - return (bdi->state & bdi_bits); + return test_bit(WB_writeback_running, &wb->state); } -static inline int bdi_read_congested(struct backing_dev_info *bdi) +static inline struct backing_dev_info *inode_to_bdi(struct inode *inode) { - return bdi_congested(bdi, 1 << BDI_sync_congested); -} + struct super_block *sb; -static inline int bdi_write_congested(struct backing_dev_info *bdi) -{ - return bdi_congested(bdi, 1 << BDI_async_congested); + if (!inode) + return &noop_backing_dev_info; + + sb = inode->i_sb; +#ifdef CONFIG_BLOCK + if (sb_is_blkdev_sb(sb)) + return blk_get_backing_dev_info(I_BDEV(inode)); +#endif + return sb->s_bdi; } -static inline int bdi_rw_congested(struct backing_dev_info *bdi) +static inline int wb_congested(struct bdi_writeback *wb, int cong_bits) { - return bdi_congested(bdi, (1 << BDI_sync_congested) | - (1 << BDI_async_congested)); -} + struct backing_dev_info *bdi = wb->bdi; -enum { - BLK_RW_ASYNC = 0, - BLK_RW_SYNC = 1, -}; + if (bdi->congested_fn) + return bdi->congested_fn(bdi->congested_data, cong_bits); + return wb->congested->state & cong_bits; +} -void clear_bdi_congested(struct backing_dev_info *bdi, int sync); -void set_bdi_congested(struct backing_dev_info *bdi, int sync); long congestion_wait(int sync, long timeout); long wait_iff_congested(struct zone *zone, int sync, long timeout); int pdflush_proc_obsolete(struct ctl_table *table, int write, @@ -318,4 +235,336 @@ static inline int bdi_sched_wait(void *word) return 0; } -#endif /* _LINUX_BACKING_DEV_H */ +#ifdef CONFIG_CGROUP_WRITEBACK + +struct bdi_writeback_congested * +wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp); +void wb_congested_put(struct bdi_writeback_congested *congested); +struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css, + gfp_t gfp); +void wb_memcg_offline(struct mem_cgroup *memcg); +void wb_blkcg_offline(struct blkcg *blkcg); +int inode_congested(struct inode *inode, int cong_bits); + +/** + * inode_cgwb_enabled - test whether cgroup writeback is enabled on an inode + * @inode: inode of interest + * + * cgroup writeback requires support from both the bdi and filesystem. + * Test whether @inode has both. + */ +static inline bool inode_cgwb_enabled(struct inode *inode) +{ + struct backing_dev_info *bdi = inode_to_bdi(inode); + + return bdi_cap_account_dirty(bdi) && + (bdi->capabilities & BDI_CAP_CGROUP_WRITEBACK) && + (inode->i_sb->s_iflags & SB_I_CGROUPWB); +} + +/** + * wb_find_current - find wb for %current on a bdi + * @bdi: bdi of interest + * + * Find the wb of @bdi which matches both the memcg and blkcg of %current. + * Must be called under rcu_read_lock() which protects the returend wb. + * NULL if not found. + */ +static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) +{ + struct cgroup_subsys_state *memcg_css; + struct bdi_writeback *wb; + + memcg_css = task_css(current, memory_cgrp_id); + if (!memcg_css->parent) + return &bdi->wb; + + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + + /* + * %current's blkcg equals the effective blkcg of its memcg. No + * need to use the relatively expensive cgroup_get_e_css(). + */ + if (likely(wb && wb->blkcg_css == task_css(current, blkio_cgrp_id))) + return wb; + return NULL; +} + +/** + * wb_get_create_current - get or create wb for %current on a bdi + * @bdi: bdi of interest + * @gfp: allocation mask + * + * Equivalent to wb_get_create() on %current's memcg. This function is + * called from a relatively hot path and optimizes the common cases using + * wb_find_current(). + */ +static inline struct bdi_writeback * +wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) +{ + struct bdi_writeback *wb; + + rcu_read_lock(); + wb = wb_find_current(bdi); + if (wb && unlikely(!wb_tryget(wb))) + wb = NULL; + rcu_read_unlock(); + + if (unlikely(!wb)) { + struct cgroup_subsys_state *memcg_css; + + memcg_css = task_get_css(current, memory_cgrp_id); + wb = wb_get_create(bdi, memcg_css, gfp); + css_put(memcg_css); + } + return wb; +} + +/** + * inode_to_wb_is_valid - test whether an inode has a wb associated + * @inode: inode of interest + * + * Returns %true if @inode has a wb associated. May be called without any + * locking. + */ +static inline bool inode_to_wb_is_valid(struct inode *inode) +{ + return inode->i_wb; +} + +/** + * inode_to_wb - determine the wb of an inode + * @inode: inode of interest + * + * Returns the wb @inode is currently associated with. The caller must be + * holding either @inode->i_lock, @inode->i_mapping->tree_lock, or the + * associated wb's list_lock. + */ +static inline struct bdi_writeback *inode_to_wb(struct inode *inode) +{ +#ifdef CONFIG_LOCKDEP + WARN_ON_ONCE(debug_locks && + (!lockdep_is_held(&inode->i_lock) && + !lockdep_is_held(&inode->i_mapping->tree_lock) && + !lockdep_is_held(&inode->i_wb->list_lock))); +#endif + return inode->i_wb; +} + +/** + * unlocked_inode_to_wb_begin - begin unlocked inode wb access transaction + * @inode: target inode + * @lockedp: temp bool output param, to be passed to the end function + * + * The caller wants to access the wb associated with @inode but isn't + * holding inode->i_lock, mapping->tree_lock or wb->list_lock. This + * function determines the wb associated with @inode and ensures that the + * association doesn't change until the transaction is finished with + * unlocked_inode_to_wb_end(). + * + * The caller must call unlocked_inode_to_wb_end() with *@lockdep + * afterwards and can't sleep during transaction. IRQ may or may not be + * disabled on return. + */ +static inline struct bdi_writeback * +unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp) +{ + rcu_read_lock(); + + /* + * Paired with store_release in inode_switch_wb_work_fn() and + * ensures that we see the new wb if we see cleared I_WB_SWITCH. + */ + *lockedp = smp_load_acquire(&inode->i_state) & I_WB_SWITCH; + + if (unlikely(*lockedp)) + spin_lock_irq(&inode->i_mapping->tree_lock); + + /* + * Protected by either !I_WB_SWITCH + rcu_read_lock() or tree_lock. + * inode_to_wb() will bark. Deref directly. + */ + return inode->i_wb; +} + +/** + * unlocked_inode_to_wb_end - end inode wb access transaction + * @inode: target inode + * @locked: *@lockedp from unlocked_inode_to_wb_begin() + */ +static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked) +{ + if (unlikely(locked)) + spin_unlock_irq(&inode->i_mapping->tree_lock); + + rcu_read_unlock(); +} + +struct wb_iter { + int start_blkcg_id; + struct radix_tree_iter tree_iter; + void **slot; +}; + +static inline struct bdi_writeback *__wb_iter_next(struct wb_iter *iter, + struct backing_dev_info *bdi) +{ + struct radix_tree_iter *titer = &iter->tree_iter; + + WARN_ON_ONCE(!rcu_read_lock_held()); + + if (iter->start_blkcg_id >= 0) { + iter->slot = radix_tree_iter_init(titer, iter->start_blkcg_id); + iter->start_blkcg_id = -1; + } else { + iter->slot = radix_tree_next_slot(iter->slot, titer, 0); + } + + if (!iter->slot) + iter->slot = radix_tree_next_chunk(&bdi->cgwb_tree, titer, 0); + if (iter->slot) + return *iter->slot; + return NULL; +} + +static inline struct bdi_writeback *__wb_iter_init(struct wb_iter *iter, + struct backing_dev_info *bdi, + int start_blkcg_id) +{ + iter->start_blkcg_id = start_blkcg_id; + + if (start_blkcg_id) + return __wb_iter_next(iter, bdi); + else + return &bdi->wb; +} + +/** + * bdi_for_each_wb - walk all wb's of a bdi in ascending blkcg ID order + * @wb_cur: cursor struct bdi_writeback pointer + * @bdi: bdi to walk wb's of + * @iter: pointer to struct wb_iter to be used as iteration buffer + * @start_blkcg_id: blkcg ID to start iteration from + * + * Iterate @wb_cur through the wb's (bdi_writeback's) of @bdi in ascending + * blkcg ID order starting from @start_blkcg_id. @iter is struct wb_iter + * to be used as temp storage during iteration. rcu_read_lock() must be + * held throughout iteration. + */ +#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \ + for ((wb_cur) = __wb_iter_init(iter, bdi, start_blkcg_id); \ + (wb_cur); (wb_cur) = __wb_iter_next(iter, bdi)) + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static inline bool inode_cgwb_enabled(struct inode *inode) +{ + return false; +} + +static inline struct bdi_writeback_congested * +wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp) +{ + atomic_inc(&bdi->wb_congested->refcnt); + return bdi->wb_congested; +} + +static inline void wb_congested_put(struct bdi_writeback_congested *congested) +{ + if (atomic_dec_and_test(&congested->refcnt)) + kfree(congested); +} + +static inline struct bdi_writeback *wb_find_current(struct backing_dev_info *bdi) +{ + return &bdi->wb; +} + +static inline struct bdi_writeback * +wb_get_create_current(struct backing_dev_info *bdi, gfp_t gfp) +{ + return &bdi->wb; +} + +static inline bool inode_to_wb_is_valid(struct inode *inode) +{ + return true; +} + +static inline struct bdi_writeback *inode_to_wb(struct inode *inode) +{ + return &inode_to_bdi(inode)->wb; +} + +static inline struct bdi_writeback * +unlocked_inode_to_wb_begin(struct inode *inode, bool *lockedp) +{ + return inode_to_wb(inode); +} + +static inline void unlocked_inode_to_wb_end(struct inode *inode, bool locked) +{ +} + +static inline void wb_memcg_offline(struct mem_cgroup *memcg) +{ +} + +static inline void wb_blkcg_offline(struct blkcg *blkcg) +{ +} + +struct wb_iter { + int next_id; +}; + +#define bdi_for_each_wb(wb_cur, bdi, iter, start_blkcg_id) \ + for ((iter)->next_id = (start_blkcg_id); \ + ({ (wb_cur) = !(iter)->next_id++ ? &(bdi)->wb : NULL; }); ) + +static inline int inode_congested(struct inode *inode, int cong_bits) +{ + return wb_congested(&inode_to_bdi(inode)->wb, cong_bits); +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + +static inline int inode_read_congested(struct inode *inode) +{ + return inode_congested(inode, 1 << WB_sync_congested); +} + +static inline int inode_write_congested(struct inode *inode) +{ + return inode_congested(inode, 1 << WB_async_congested); +} + +static inline int inode_rw_congested(struct inode *inode) +{ + return inode_congested(inode, (1 << WB_sync_congested) | + (1 << WB_async_congested)); +} + +static inline int bdi_congested(struct backing_dev_info *bdi, int cong_bits) +{ + return wb_congested(&bdi->wb, cong_bits); +} + +static inline int bdi_read_congested(struct backing_dev_info *bdi) +{ + return bdi_congested(bdi, 1 << WB_sync_congested); +} + +static inline int bdi_write_congested(struct backing_dev_info *bdi) +{ + return bdi_congested(bdi, 1 << WB_async_congested); +} + +static inline int bdi_rw_congested(struct backing_dev_info *bdi) +{ + return bdi_congested(bdi, (1 << WB_sync_congested) | + (1 << WB_async_congested)); +} + +#endif /* _LINUX_BACKING_DEV_H */ diff --git a/include/linux/bio.h b/include/linux/bio.h index da3a127c9958..cbc5d1d0e47c 100644 --- a/include/linux/bio.h +++ b/include/linux/bio.h @@ -469,9 +469,12 @@ extern void bvec_free(mempool_t *, struct bio_vec *, unsigned int); extern unsigned int bvec_nr_vecs(unsigned short idx); #ifdef CONFIG_BLK_CGROUP +int bio_associate_blkcg(struct bio *bio, struct cgroup_subsys_state *blkcg_css); int bio_associate_current(struct bio *bio); void bio_disassociate_task(struct bio *bio); #else /* CONFIG_BLK_CGROUP */ +static inline int bio_associate_blkcg(struct bio *bio, + struct cgroup_subsys_state *blkcg_css) { return 0; } static inline int bio_associate_current(struct bio *bio) { return -ENOENT; } static inline void bio_disassociate_task(struct bio *bio) { } #endif /* CONFIG_BLK_CGROUP */ diff --git a/block/blk-cgroup.h b/include/linux/blk-cgroup.h index c567865b5f1d..07a32b813ed8 100644 --- a/block/blk-cgroup.h +++ b/include/linux/blk-cgroup.h @@ -53,6 +53,10 @@ struct blkcg { /* TODO: per-policy storage in blkcg */ unsigned int cfq_weight; /* belongs to cfq */ unsigned int cfq_leaf_weight; + +#ifdef CONFIG_CGROUP_WRITEBACK + struct list_head cgwb_list; +#endif }; struct blkg_stat { @@ -95,6 +99,12 @@ struct blkcg_gq { struct hlist_node blkcg_node; struct blkcg *blkcg; + /* + * Each blkg gets congested separately and the congestion state is + * propagated to the matching bdi_writeback_congested. + */ + struct bdi_writeback_congested *wb_congested; + /* all non-root blkcg_gq's are guaranteed to have access to parent */ struct blkcg_gq *parent; @@ -134,6 +144,7 @@ struct blkcg_policy { }; extern struct blkcg blkcg_root; +extern struct cgroup_subsys_state * const blkcg_root_css; struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, struct request_queue *q); struct blkcg_gq *blkg_lookup_create(struct blkcg *blkcg, @@ -194,6 +205,12 @@ static inline struct blkcg *bio_blkcg(struct bio *bio) return task_blkcg(current); } +static inline struct cgroup_subsys_state * +task_get_blkcg_css(struct task_struct *task) +{ + return task_get_css(task, blkio_cgrp_id); +} + /** * blkcg_parent - get the parent of a blkcg * @blkcg: blkcg of interest @@ -558,8 +575,8 @@ static inline void blkg_rwstat_merge(struct blkg_rwstat *to, #else /* CONFIG_BLK_CGROUP */ -struct cgroup; -struct blkcg; +struct blkcg { +}; struct blkg_policy_data { }; @@ -570,6 +587,16 @@ struct blkcg_gq { struct blkcg_policy { }; +#define blkcg_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL)) + +static inline struct cgroup_subsys_state * +task_get_blkcg_css(struct task_struct *task) +{ + return NULL; +} + +#ifdef CONFIG_BLOCK + static inline struct blkcg_gq *blkg_lookup(struct blkcg *blkcg, void *key) { return NULL; } static inline int blkcg_init_queue(struct request_queue *q) { return 0; } static inline void blkcg_drain_queue(struct request_queue *q) { } @@ -599,5 +626,6 @@ static inline struct request_list *blk_rq_rl(struct request *rq) { return &rq->q #define blk_queue_for_each_rl(rl, q) \ for ((rl) = &(q)->root_rl; (rl); (rl) = NULL) +#endif /* CONFIG_BLOCK */ #endif /* CONFIG_BLK_CGROUP */ #endif /* _BLK_CGROUP_H */ diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index b7299febc4b4..7fda78fc098a 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -118,7 +118,6 @@ struct bio { #define BIO_CLONED 4 /* doesn't own data */ #define BIO_BOUNCED 5 /* bio is a bounce bio */ #define BIO_USER_MAPPED 6 /* contains user pages */ -#define BIO_EOPNOTSUPP 7 /* not supported */ #define BIO_NULL_MAPPED 8 /* contains invalid user pages */ #define BIO_QUIET 9 /* Make BIO Quiet */ #define BIO_SNAP_STABLE 10 /* bio data must be snapshotted during write */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 5d93a6645e88..c76b5fef714a 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -12,7 +12,7 @@ #include <linux/timer.h> #include <linux/workqueue.h> #include <linux/pagemap.h> -#include <linux/backing-dev.h> +#include <linux/backing-dev-defs.h> #include <linux/wait.h> #include <linux/mempool.h> #include <linux/bio.h> @@ -821,25 +821,6 @@ extern int scsi_cmd_ioctl(struct request_queue *, struct gendisk *, fmode_t, extern int sg_scsi_ioctl(struct request_queue *, struct gendisk *, fmode_t, struct scsi_ioctl_command __user *); -/* - * A queue has just exitted congestion. Note this in the global counter of - * congested queues, and wake up anyone who was waiting for requests to be - * put back. - */ -static inline void blk_clear_queue_congested(struct request_queue *q, int sync) -{ - clear_bdi_congested(&q->backing_dev_info, sync); -} - -/* - * A queue has just entered congestion. Flag that in the queue's VM-visible - * state flags and increment the global gounter of congested queues. - */ -static inline void blk_set_queue_congested(struct request_queue *q, int sync) -{ - set_bdi_congested(&q->backing_dev_info, sync); -} - extern void blk_start_queue(struct request_queue *q); extern void blk_stop_queue(struct request_queue *q); extern void blk_sync_queue(struct request_queue *q); diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 96a2ecd5aa69..7c37a5fa3ea5 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -323,6 +323,31 @@ static inline struct cgroup_subsys_state *task_css(struct task_struct *task, } /** + * task_get_css - find and get the css for (task, subsys) + * @task: the target task + * @subsys_id: the target subsystem ID + * + * Find the css for the (@task, @subsys_id) combination, increment a + * reference on and return it. This function is guaranteed to return a + * valid css. + */ +static inline struct cgroup_subsys_state * +task_get_css(struct task_struct *task, int subsys_id) +{ + struct cgroup_subsys_state *css; + + rcu_read_lock(); + while (true) { + css = task_css(task, subsys_id); + if (likely(css_tryget_online(css))) + break; + cpu_relax(); + } + rcu_read_unlock(); + return css; +} + +/** * task_css_is_root - test whether a task belongs to the root css * @task: the target task * @subsys_id: the target subsystem ID diff --git a/include/linux/cpu_cooling.h b/include/linux/cpu_cooling.h index bd955270d5aa..c156f5082758 100644 --- a/include/linux/cpu_cooling.h +++ b/include/linux/cpu_cooling.h @@ -28,6 +28,9 @@ #include <linux/thermal.h> #include <linux/cpumask.h> +typedef int (*get_static_t)(cpumask_t *cpumask, int interval, + unsigned long voltage, u32 *power); + #ifdef CONFIG_CPU_THERMAL /** * cpufreq_cooling_register - function to create cpufreq cooling device. @@ -36,6 +39,10 @@ struct thermal_cooling_device * cpufreq_cooling_register(const struct cpumask *clip_cpus); +struct thermal_cooling_device * +cpufreq_power_cooling_register(const struct cpumask *clip_cpus, + u32 capacitance, get_static_t plat_static_func); + /** * of_cpufreq_cooling_register - create cpufreq cooling device based on DT. * @np: a valid struct device_node to the cooling device device tree node. @@ -45,6 +52,12 @@ cpufreq_cooling_register(const struct cpumask *clip_cpus); struct thermal_cooling_device * of_cpufreq_cooling_register(struct device_node *np, const struct cpumask *clip_cpus); + +struct thermal_cooling_device * +of_cpufreq_power_cooling_register(struct device_node *np, + const struct cpumask *clip_cpus, + u32 capacitance, + get_static_t plat_static_func); #else static inline struct thermal_cooling_device * of_cpufreq_cooling_register(struct device_node *np, @@ -52,6 +65,15 @@ of_cpufreq_cooling_register(struct device_node *np, { return ERR_PTR(-ENOSYS); } + +static inline struct thermal_cooling_device * +of_cpufreq_power_cooling_register(struct device_node *np, + const struct cpumask *clip_cpus, + u32 capacitance, + get_static_t plat_static_func) +{ + return NULL; +} #endif /** @@ -68,11 +90,28 @@ cpufreq_cooling_register(const struct cpumask *clip_cpus) return ERR_PTR(-ENOSYS); } static inline struct thermal_cooling_device * +cpufreq_power_cooling_register(const struct cpumask *clip_cpus, + u32 capacitance, get_static_t plat_static_func) +{ + return NULL; +} + +static inline struct thermal_cooling_device * of_cpufreq_cooling_register(struct device_node *np, const struct cpumask *clip_cpus) { return ERR_PTR(-ENOSYS); } + +static inline struct thermal_cooling_device * +of_cpufreq_power_cooling_register(struct device_node *np, + const struct cpumask *clip_cpus, + u32 capacitance, + get_static_t plat_static_func) +{ + return NULL; +} + static inline void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev) { diff --git a/include/linux/device.h b/include/linux/device.h index 6558af90c8fe..dda6d94ce986 100644 --- a/include/linux/device.h +++ b/include/linux/device.h @@ -683,6 +683,7 @@ struct device_dma_parameters { * along with subsystem-level and driver-level callbacks. * @pins: For device pin management. * See Documentation/pinctrl.txt for details. + * @msi_domain: The generic MSI domain this device is using. * @numa_node: NUMA node this device is close to. * @dma_mask: Dma mask (if dma'ble device). * @coherent_dma_mask: Like dma_mask, but for alloc_coherent mapping as not all @@ -743,6 +744,9 @@ struct device { struct dev_pm_info power; struct dev_pm_domain *pm_domain; +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN + struct irq_domain *msi_domain; +#endif #ifdef CONFIG_PINCTRL struct dev_pin_info *pins; #endif @@ -830,6 +834,22 @@ static inline void set_dev_node(struct device *dev, int node) } #endif +static inline struct irq_domain *dev_get_msi_domain(const struct device *dev) +{ +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN + return dev->msi_domain; +#else + return NULL; +#endif +} + +static inline void dev_set_msi_domain(struct device *dev, struct irq_domain *d) +{ +#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN + dev->msi_domain = d; +#endif +} + static inline void *dev_get_drvdata(const struct device *dev) { return dev->driver_data; diff --git a/include/linux/fs.h b/include/linux/fs.h index fdc369fa69e8..a3efb3c0d310 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -35,6 +35,7 @@ #include <uapi/linux/fs.h> struct backing_dev_info; +struct bdi_writeback; struct export_operations; struct hd_geometry; struct iovec; @@ -635,6 +636,14 @@ struct inode { struct hlist_node i_hash; struct list_head i_wb_list; /* backing dev IO list */ +#ifdef CONFIG_CGROUP_WRITEBACK + struct bdi_writeback *i_wb; /* the associated cgroup wb */ + + /* foreign inode detection, see wbc_detach_inode() */ + int i_wb_frn_winner; + u16 i_wb_frn_avg_time; + u16 i_wb_frn_history; +#endif struct list_head i_lru; /* inode LRU list */ struct list_head i_sb_list; union { @@ -1247,6 +1256,8 @@ struct mm_struct; #define UMOUNT_NOFOLLOW 0x00000008 /* Don't follow symlink on umount */ #define UMOUNT_UNUSED 0x80000000 /* Flag guaranteed to be unused */ +/* sb->s_iflags */ +#define SB_I_CGROUPWB 0x00000001 /* cgroup-aware writeback enabled */ /* Possible states of 'frozen' field */ enum { @@ -1285,6 +1296,7 @@ struct super_block { const struct quotactl_ops *s_qcop; const struct export_operations *s_export_op; unsigned long s_flags; + unsigned long s_iflags; /* internal SB_I_* flags */ unsigned long s_magic; struct dentry *s_root; struct rw_semaphore s_umount; @@ -1820,6 +1832,11 @@ struct super_operations { * * I_DIO_WAKEUP Never set. Only used as a key for wait_on_bit(). * + * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to + * synchronize competing switching instances and to tell + * wb stat updates to grab mapping->tree_lock. See + * inode_switch_wb_work_fn() for details. + * * Q: What is the difference between I_WILL_FREE and I_FREEING? */ #define I_DIRTY_SYNC (1 << 0) @@ -1839,6 +1856,7 @@ struct super_operations { #define I_DIRTY_TIME (1 << 11) #define __I_DIRTY_TIME_EXPIRED 12 #define I_DIRTY_TIME_EXPIRED (1 << __I_DIRTY_TIME_EXPIRED) +#define I_WB_SWITCH (1 << 13) #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES) #define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME) @@ -2248,7 +2266,13 @@ extern struct super_block *freeze_bdev(struct block_device *); extern void emergency_thaw_all(void); extern int thaw_bdev(struct block_device *bdev, struct super_block *sb); extern int fsync_bdev(struct block_device *); -extern int sb_is_blkdev_sb(struct super_block *sb); + +extern struct super_block *blockdev_superblock; + +static inline bool sb_is_blkdev_sb(struct super_block *sb) +{ + return sb == blockdev_superblock; +} #else static inline void bd_forget(struct inode *inode) {} static inline int sync_blockdev(struct block_device *bdev) { return 0; } diff --git a/include/linux/irqchip/arm-gic-v3.h b/include/linux/irqchip/arm-gic-v3.h index ffbc034c8810..bf982e021fbd 100644 --- a/include/linux/irqchip/arm-gic-v3.h +++ b/include/linux/irqchip/arm-gic-v3.h @@ -360,6 +360,7 @@ #ifndef __ASSEMBLY__ #include <linux/stringify.h> +#include <asm/msi.h> /* * We need a value to serve as a irq-type for LPIs. Choose one that will diff --git a/include/linux/irqdomain.h b/include/linux/irqdomain.h index 676d7306a360..3c5ca459ac98 100644 --- a/include/linux/irqdomain.h +++ b/include/linux/irqdomain.h @@ -45,6 +45,20 @@ struct irq_data; /* Number of irqs reserved for a legacy isa controller */ #define NUM_ISA_INTERRUPTS 16 +/* + * Should several domains have the same device node, but serve + * different purposes (for example one domain is for PCI/MSI, and the + * other for wired IRQs), they can be distinguished using a + * bus-specific token. Most domains are expected to only carry + * DOMAIN_BUS_ANY. + */ +enum irq_domain_bus_token { + DOMAIN_BUS_ANY = 0, + DOMAIN_BUS_PCI_MSI, + DOMAIN_BUS_PLATFORM_MSI, + DOMAIN_BUS_NEXUS, +}; + /** * struct irq_domain_ops - Methods for irq_domain objects * @match: Match an interrupt controller device node to a host, returns @@ -61,7 +75,8 @@ struct irq_data; * to setup the irq_desc when returning from map(). */ struct irq_domain_ops { - int (*match)(struct irq_domain *d, struct device_node *node); + int (*match)(struct irq_domain *d, struct device_node *node, + enum irq_domain_bus_token bus_token); int (*map)(struct irq_domain *d, unsigned int virq, irq_hw_number_t hw); void (*unmap)(struct irq_domain *d, unsigned int virq); int (*xlate)(struct irq_domain *d, struct device_node *node, @@ -116,6 +131,7 @@ struct irq_domain { /* Optional data */ struct device_node *of_node; + enum irq_domain_bus_token bus_token; struct irq_domain_chip_generic *gc; #ifdef CONFIG_IRQ_DOMAIN_HIERARCHY struct irq_domain *parent; @@ -161,9 +177,15 @@ struct irq_domain *irq_domain_add_legacy(struct device_node *of_node, irq_hw_number_t first_hwirq, const struct irq_domain_ops *ops, void *host_data); -extern struct irq_domain *irq_find_host(struct device_node *node); +extern struct irq_domain *irq_find_matching_host(struct device_node *node, + enum irq_domain_bus_token bus_token); extern void irq_set_default_host(struct irq_domain *host); +static inline struct irq_domain *irq_find_host(struct device_node *node) +{ + return irq_find_matching_host(node, DOMAIN_BUS_ANY); +} + /** * irq_domain_add_linear() - Allocate and register a linear revmap irq_domain. * @of_node: pointer to interrupt controller's device tree node. diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c8918114804..73b02b0a8f60 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -41,6 +41,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_RSS_HUGE, /* # of pages charged as anon huge */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ + MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */ MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */ MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ MEM_CGROUP_STAT_NSTATS, @@ -67,6 +68,8 @@ enum mem_cgroup_events_index { }; #ifdef CONFIG_MEMCG +extern struct cgroup_subsys_state *mem_cgroup_root_css; + void mem_cgroup_events(struct mem_cgroup *memcg, enum mem_cgroup_events_index idx, unsigned int nr); @@ -112,6 +115,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm, } extern struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg); +extern struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page); struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *, struct mem_cgroup *, @@ -195,6 +199,8 @@ void mem_cgroup_split_huge_fixup(struct page *head); #else /* CONFIG_MEMCG */ struct mem_cgroup; +#define mem_cgroup_root_css ((struct cgroup_subsys_state *)ERR_PTR(-EINVAL)) + static inline void mem_cgroup_events(struct mem_cgroup *memcg, enum mem_cgroup_events_index idx, unsigned int nr) @@ -382,6 +388,29 @@ enum { OVER_LIMIT, }; +#ifdef CONFIG_CGROUP_WRITEBACK + +struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg); +struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb); +void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail, + unsigned long *pdirty, unsigned long *pwriteback); + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static inline struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) +{ + return NULL; +} + +static inline void mem_cgroup_wb_stats(struct bdi_writeback *wb, + unsigned long *pavail, + unsigned long *pdirty, + unsigned long *pwriteback) +{ +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + struct sock; #if defined(CONFIG_INET) && defined(CONFIG_MEMCG_KMEM) void sock_update_memcg(struct sock *sk); diff --git a/include/linux/mm.h b/include/linux/mm.h index b2085582d44e..ed271d320a00 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -27,6 +27,7 @@ struct anon_vma_chain; struct file_ra_state; struct user_struct; struct writeback_control; +struct bdi_writeback; #ifndef CONFIG_NEED_MULTIPLE_NODES /* Don't use mapnrs, do it properly */ extern unsigned long max_mapnr; @@ -1239,10 +1240,13 @@ int __set_page_dirty_nobuffers(struct page *page); int __set_page_dirty_no_writeback(struct page *page); int redirty_page_for_writepage(struct writeback_control *wbc, struct page *page); -void account_page_dirtied(struct page *page, struct address_space *mapping); -void account_page_cleaned(struct page *page, struct address_space *mapping); +void account_page_dirtied(struct page *page, struct address_space *mapping, + struct mem_cgroup *memcg); +void account_page_cleaned(struct page *page, struct address_space *mapping, + struct mem_cgroup *memcg, struct bdi_writeback *wb); int set_page_dirty(struct page *page); int set_page_dirty_lock(struct page *page); +void cancel_dirty_page(struct page *page); int clear_page_dirty_for_io(struct page *page); int get_cmdline(struct task_struct *task, char *buffer, int buflen); diff --git a/include/linux/msi.h b/include/linux/msi.h index 8ac4a68ffae2..692f217ae813 100644 --- a/include/linux/msi.h +++ b/include/linux/msi.h @@ -108,9 +108,6 @@ struct msi_controller { struct device *dev; struct device_node *of_node; struct list_head list; -#ifdef CONFIG_GENERIC_MSI_IRQ_DOMAIN - struct irq_domain *domain; -#endif int (*setup_irq)(struct msi_controller *chip, struct pci_dev *dev, struct msi_desc *desc); diff --git a/include/linux/of_irq.h b/include/linux/of_irq.h index d884929a7747..4bcbd586a672 100644 --- a/include/linux/of_irq.h +++ b/include/linux/of_irq.h @@ -74,6 +74,7 @@ static inline int of_irq_to_resource_table(struct device_node *dev, */ extern unsigned int irq_of_parse_and_map(struct device_node *node, int index); extern struct device_node *of_irq_find_parent(struct device_node *child); +extern void of_msi_configure(struct device *dev, struct device_node *np); #else /* !CONFIG_OF */ static inline unsigned int irq_of_parse_and_map(struct device_node *dev, diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 4b3736f7065c..fb0814ca65c7 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -651,7 +651,8 @@ int add_to_page_cache_locked(struct page *page, struct address_space *mapping, int add_to_page_cache_lru(struct page *page, struct address_space *mapping, pgoff_t index, gfp_t gfp_mask); extern void delete_from_page_cache(struct page *page); -extern void __delete_from_page_cache(struct page *page, void *shadow); +extern void __delete_from_page_cache(struct page *page, void *shadow, + struct mem_cgroup *memcg); int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); /* diff --git a/include/linux/pci.h b/include/linux/pci.h index 6e935e5eab56..661b48ae34b4 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1857,10 +1857,12 @@ int pci_vpd_find_info_keyword(const u8 *buf, unsigned int off, /* PCI <-> OF binding helpers */ #ifdef CONFIG_OF struct device_node; +struct irq_domain; void pci_set_of_node(struct pci_dev *dev); void pci_release_of_node(struct pci_dev *dev); void pci_set_bus_of_node(struct pci_bus *bus); void pci_release_bus_of_node(struct pci_bus *bus); +struct irq_domain *pci_host_bridge_of_msi_domain(struct pci_bus *bus); /* Arch may override this (weak) */ struct device_node *pcibios_get_phb_of_node(struct pci_bus *bus); @@ -1883,6 +1885,8 @@ static inline void pci_set_bus_of_node(struct pci_bus *bus) { } static inline void pci_release_bus_of_node(struct pci_bus *bus) { } static inline struct device_node * pci_device_to_OF_node(const struct pci_dev *pdev) { return NULL; } +static inline struct irq_domain * +pci_host_bridge_of_msi_domain(struct pci_bus *bus) { return NULL; } #endif /* CONFIG_OF */ #ifdef CONFIG_EEH diff --git a/include/linux/psci.h b/include/linux/psci.h new file mode 100644 index 000000000000..12c4865457ad --- /dev/null +++ b/include/linux/psci.h @@ -0,0 +1,54 @@ +/* + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License version 2 as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * Copyright (C) 2015 ARM Limited + */ + +#ifndef __LINUX_PSCI_H +#define __LINUX_PSCI_H + +#include <linux/init.h> +#include <linux/types.h> + +#define PSCI_POWER_STATE_TYPE_STANDBY 0 +#define PSCI_POWER_STATE_TYPE_POWER_DOWN 1 + +bool psci_tos_resident_on(int cpu); +bool psci_power_state_loses_context(u32 state); +bool psci_power_state_is_valid(u32 state); + +struct psci_operations { + int (*cpu_suspend)(u32 state, unsigned long entry_point); + int (*cpu_off)(u32 state); + int (*cpu_on)(unsigned long cpuid, unsigned long entry_point); + int (*migrate)(unsigned long cpuid); + int (*affinity_info)(unsigned long target_affinity, + unsigned long lowest_affinity_level); + int (*migrate_info_type)(void); +}; + +extern struct psci_operations psci_ops; + +#if defined(CONFIG_ARM_PSCI_FW) +int __init psci_dt_init(void); +#else +static inline int psci_dt_init(void) { return 0; } +#endif + +#if defined(CONFIG_ARM_PSCI_FW) && defined(CONFIG_ACPI) +int __init psci_acpi_init(void); +bool __init acpi_psci_present(void); +bool __init acpi_psci_use_hvc(void); +#else +static inline int psci_acpi_init(void) { return 0; } +static inline bool acpi_psci_present(void) { return false; } +#endif + +#endif /* __LINUX_PSCI_H */ diff --git a/include/linux/thermal.h b/include/linux/thermal.h index 2e7d0f7a0ecc..e7cec37ccf53 100644 --- a/include/linux/thermal.h +++ b/include/linux/thermal.h @@ -40,6 +40,9 @@ /* No upper/lower limit requirement */ #define THERMAL_NO_LIMIT ((u32)~0) +/* Default weight of a bound cooling device */ +#define THERMAL_WEIGHT_DEFAULT 0 + /* use value, which < 0K, to indicate an invalid/uninitialized temperature */ #define THERMAL_TEMP_INVALID -274000 @@ -59,10 +62,13 @@ #define DEFAULT_THERMAL_GOVERNOR "fair_share" #elif defined(CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE) #define DEFAULT_THERMAL_GOVERNOR "user_space" +#elif defined(CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR) +#define DEFAULT_THERMAL_GOVERNOR "power_allocator" #endif struct thermal_zone_device; struct thermal_cooling_device; +struct thermal_instance; enum thermal_device_mode { THERMAL_DEVICE_DISABLED = 0, @@ -116,6 +122,12 @@ struct thermal_cooling_device_ops { int (*get_max_state) (struct thermal_cooling_device *, unsigned long *); int (*get_cur_state) (struct thermal_cooling_device *, unsigned long *); int (*set_cur_state) (struct thermal_cooling_device *, unsigned long); + int (*get_requested_power)(struct thermal_cooling_device *, + struct thermal_zone_device *, u32 *); + int (*state2power)(struct thermal_cooling_device *, + struct thermal_zone_device *, unsigned long, u32 *); + int (*power2state)(struct thermal_cooling_device *, + struct thermal_zone_device *, u32, unsigned long *); }; struct thermal_cooling_device { @@ -147,8 +159,7 @@ struct thermal_attr { * @devdata: private pointer for device private data * @trips: number of trip points the thermal zone supports * @passive_delay: number of milliseconds to wait between polls when - * performing passive cooling. Currenty only used by the - * step-wise governor + * performing passive cooling. * @polling_delay: number of milliseconds to wait between polls when * checking whether trip points have been crossed (0 for * interrupt driven systems) @@ -158,7 +169,6 @@ struct thermal_attr { * @last_temperature: previous temperature read * @emul_temperature: emulated temperature when using CONFIG_THERMAL_EMULATION * @passive: 1 if you've crossed a passive trip point, 0 otherwise. - * Currenty only used by the step-wise governor. * @forced_passive: If > 0, temperature at which to switch on all ACPI * processor cooling devices. Currently only used by the * step-wise governor. @@ -166,6 +176,7 @@ struct thermal_attr { * @ops: operations this &thermal_zone_device supports * @tzp: thermal zone parameters * @governor: pointer to the governor for this thermal zone + * @governor_data: private pointer for governor data * @thermal_instances: list of &struct thermal_instance of this thermal zone * @idr: &struct idr to generate unique id for this zone's cooling * devices @@ -191,8 +202,9 @@ struct thermal_zone_device { unsigned int forced_passive; atomic_t need_update; struct thermal_zone_device_ops *ops; - const struct thermal_zone_params *tzp; + struct thermal_zone_params *tzp; struct thermal_governor *governor; + void *governor_data; struct list_head thermal_instances; struct idr idr; struct mutex lock; @@ -203,12 +215,19 @@ struct thermal_zone_device { /** * struct thermal_governor - structure that holds thermal governor information * @name: name of the governor + * @bind_to_tz: callback called when binding to a thermal zone. If it + * returns 0, the governor is bound to the thermal zone, + * otherwise it fails. + * @unbind_from_tz: callback called when a governor is unbound from a + * thermal zone. * @throttle: callback called for every trip point even if temperature is * below the trip point temperature * @governor_list: node in thermal_governor_list (in thermal_core.c) */ struct thermal_governor { char name[THERMAL_NAME_LENGTH]; + int (*bind_to_tz)(struct thermal_zone_device *tz); + void (*unbind_from_tz)(struct thermal_zone_device *tz); int (*throttle)(struct thermal_zone_device *tz, int trip); struct list_head governor_list; }; @@ -219,9 +238,12 @@ struct thermal_bind_params { /* * This is a measure of 'how effectively these devices can - * cool 'this' thermal zone. The shall be determined by platform - * characterization. This is on a 'percentage' scale. - * See Documentation/thermal/sysfs-api.txt for more information. + * cool 'this' thermal zone. It shall be determined by + * platform characterization. This value is relative to the + * rest of the weights so a cooling device whose weight is + * double that of another cooling device is twice as + * effective. See Documentation/thermal/sysfs-api.txt for more + * information. */ int weight; @@ -258,6 +280,44 @@ struct thermal_zone_params { int num_tbps; /* Number of tbp entries */ struct thermal_bind_params *tbp; + + /* + * Sustainable power (heat) that this thermal zone can dissipate in + * mW + */ + u32 sustainable_power; + + /* + * Proportional parameter of the PID controller when + * overshooting (i.e., when temperature is below the target) + */ + s32 k_po; + + /* + * Proportional parameter of the PID controller when + * undershooting + */ + s32 k_pu; + + /* Integral parameter of the PID controller */ + s32 k_i; + + /* Derivative parameter of the PID controller */ + s32 k_d; + + /* threshold below which the error is no longer accumulated */ + s32 integral_cutoff; + + /* + * @slope: slope of a linear temperature adjustment curve. + * Used by thermal zone drivers. + */ + int slope; + /* + * @offset: offset of a linear temperature adjustment curve. + * Used by thermal zone drivers (default 0). + */ + int offset; }; struct thermal_genl_event { @@ -321,14 +381,25 @@ void thermal_zone_of_sensor_unregister(struct device *dev, #endif #if IS_ENABLED(CONFIG_THERMAL) +static inline bool cdev_is_power_actor(struct thermal_cooling_device *cdev) +{ + return cdev->ops->get_requested_power && cdev->ops->state2power && + cdev->ops->power2state; +} + +int power_actor_get_max_power(struct thermal_cooling_device *, + struct thermal_zone_device *tz, u32 *max_power); +int power_actor_set_power(struct thermal_cooling_device *, + struct thermal_instance *, u32); struct thermal_zone_device *thermal_zone_device_register(const char *, int, int, void *, struct thermal_zone_device_ops *, - const struct thermal_zone_params *, int, int); + struct thermal_zone_params *, int, int); void thermal_zone_device_unregister(struct thermal_zone_device *); int thermal_zone_bind_cooling_device(struct thermal_zone_device *, int, struct thermal_cooling_device *, - unsigned long, unsigned long); + unsigned long, unsigned long, + unsigned int); int thermal_zone_unbind_cooling_device(struct thermal_zone_device *, int, struct thermal_cooling_device *); void thermal_zone_device_update(struct thermal_zone_device *); @@ -348,6 +419,14 @@ struct thermal_instance *get_thermal_instance(struct thermal_zone_device *, void thermal_cdev_update(struct thermal_cooling_device *); void thermal_notify_framework(struct thermal_zone_device *, int); #else +static inline bool cdev_is_power_actor(struct thermal_cooling_device *cdev) +{ return false; } +static inline int power_actor_get_max_power(struct thermal_cooling_device *cdev, + struct thermal_zone_device *tz, u32 *max_power) +{ return 0; } +static inline int power_actor_set_power(struct thermal_cooling_device *cdev, + struct thermal_instance *tz, u32 power) +{ return 0; } static inline struct thermal_zone_device *thermal_zone_device_register( const char *type, int trips, int mask, void *devdata, struct thermal_zone_device_ops *ops, diff --git a/include/linux/writeback.h b/include/linux/writeback.h index b2dd371ec0ca..b333c945e571 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -7,6 +7,8 @@ #include <linux/sched.h> #include <linux/workqueue.h> #include <linux/fs.h> +#include <linux/flex_proportions.h> +#include <linux/backing-dev-defs.h> DECLARE_PER_CPU(int, dirty_throttle_leaks); @@ -84,18 +86,95 @@ struct writeback_control { unsigned for_reclaim:1; /* Invoked from the page allocator */ unsigned range_cyclic:1; /* range_start is cyclic */ unsigned for_sync:1; /* sync(2) WB_SYNC_ALL writeback */ +#ifdef CONFIG_CGROUP_WRITEBACK + struct bdi_writeback *wb; /* wb this writeback is issued under */ + struct inode *inode; /* inode being written out */ + + /* foreign inode detection, see wbc_detach_inode() */ + int wb_id; /* current wb id */ + int wb_lcand_id; /* last foreign candidate wb id */ + int wb_tcand_id; /* this foreign candidate wb id */ + size_t wb_bytes; /* bytes written by current wb */ + size_t wb_lcand_bytes; /* bytes written by last candidate */ + size_t wb_tcand_bytes; /* bytes written by this candidate */ +#endif }; /* + * A wb_domain represents a domain that wb's (bdi_writeback's) belong to + * and are measured against each other in. There always is one global + * domain, global_wb_domain, that every wb in the system is a member of. + * This allows measuring the relative bandwidth of each wb to distribute + * dirtyable memory accordingly. + */ +struct wb_domain { + spinlock_t lock; + + /* + * Scale the writeback cache size proportional to the relative + * writeout speed. + * + * We do this by keeping a floating proportion between BDIs, based + * on page writeback completions [end_page_writeback()]. Those + * devices that write out pages fastest will get the larger share, + * while the slower will get a smaller share. + * + * We use page writeout completions because we are interested in + * getting rid of dirty pages. Having them written out is the + * primary goal. + * + * We introduce a concept of time, a period over which we measure + * these events, because demand can/will vary over time. The length + * of this period itself is measured in page writeback completions. + */ + struct fprop_global completions; + struct timer_list period_timer; /* timer for aging of completions */ + unsigned long period_time; + + /* + * The dirtyable memory and dirty threshold could be suddenly + * knocked down by a large amount (eg. on the startup of KVM in a + * swapless system). This may throw the system into deep dirty + * exceeded state and throttle heavy/light dirtiers alike. To + * retain good responsiveness, maintain global_dirty_limit for + * tracking slowly down to the knocked down dirty threshold. + * + * Both fields are protected by ->lock. + */ + unsigned long dirty_limit_tstamp; + unsigned long dirty_limit; +}; + +/** + * wb_domain_size_changed - memory available to a wb_domain has changed + * @dom: wb_domain of interest + * + * This function should be called when the amount of memory available to + * @dom has changed. It resets @dom's dirty limit parameters to prevent + * the past values which don't match the current configuration from skewing + * dirty throttling. Without this, when memory size of a wb_domain is + * greatly reduced, the dirty throttling logic may allow too many pages to + * be dirtied leading to consecutive unnecessary OOMs and may get stuck in + * that situation. + */ +static inline void wb_domain_size_changed(struct wb_domain *dom) +{ + spin_lock(&dom->lock); + dom->dirty_limit_tstamp = jiffies; + dom->dirty_limit = 0; + spin_unlock(&dom->lock); +} + +/* * fs/fs-writeback.c */ struct bdi_writeback; void writeback_inodes_sb(struct super_block *, enum wb_reason reason); void writeback_inodes_sb_nr(struct super_block *, unsigned long nr, enum wb_reason reason); -int try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason); -int try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr, - enum wb_reason reason); +bool try_to_writeback_inodes_sb(struct super_block *, enum wb_reason reason); +bool try_to_writeback_inodes_sb_nr(struct super_block *, unsigned long nr, + enum wb_reason reason); void sync_inodes_sb(struct super_block *); void wakeup_flusher_threads(long nr_pages, enum wb_reason reason); void inode_wait_for_writeback(struct inode *inode); @@ -107,6 +186,123 @@ static inline void wait_on_inode(struct inode *inode) wait_on_bit(&inode->i_state, __I_NEW, TASK_UNINTERRUPTIBLE); } +#ifdef CONFIG_CGROUP_WRITEBACK + +#include <linux/cgroup.h> +#include <linux/bio.h> + +void __inode_attach_wb(struct inode *inode, struct page *page); +void wbc_attach_and_unlock_inode(struct writeback_control *wbc, + struct inode *inode) + __releases(&inode->i_lock); +void wbc_detach_inode(struct writeback_control *wbc); +void wbc_account_io(struct writeback_control *wbc, struct page *page, + size_t bytes); + +/** + * inode_attach_wb - associate an inode with its wb + * @inode: inode of interest + * @page: page being dirtied (may be NULL) + * + * If @inode doesn't have its wb, associate it with the wb matching the + * memcg of @page or, if @page is NULL, %current. May be called w/ or w/o + * @inode->i_lock. + */ +static inline void inode_attach_wb(struct inode *inode, struct page *page) +{ + if (!inode->i_wb) + __inode_attach_wb(inode, page); +} + +/** + * inode_detach_wb - disassociate an inode from its wb + * @inode: inode of interest + * + * @inode is being freed. Detach from its wb. + */ +static inline void inode_detach_wb(struct inode *inode) +{ + if (inode->i_wb) { + wb_put(inode->i_wb); + inode->i_wb = NULL; + } +} + +/** + * wbc_attach_fdatawrite_inode - associate wbc and inode for fdatawrite + * @wbc: writeback_control of interest + * @inode: target inode + * + * This function is to be used by __filemap_fdatawrite_range(), which is an + * alternative entry point into writeback code, and first ensures @inode is + * associated with a bdi_writeback and attaches it to @wbc. + */ +static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc, + struct inode *inode) +{ + spin_lock(&inode->i_lock); + inode_attach_wb(inode, NULL); + wbc_attach_and_unlock_inode(wbc, inode); +} + +/** + * wbc_init_bio - writeback specific initializtion of bio + * @wbc: writeback_control for the writeback in progress + * @bio: bio to be initialized + * + * @bio is a part of the writeback in progress controlled by @wbc. Perform + * writeback specific initialization. This is used to apply the cgroup + * writeback context. + */ +static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio) +{ + /* + * pageout() path doesn't attach @wbc to the inode being written + * out. This is intentional as we don't want the function to block + * behind a slow cgroup. Ultimately, we want pageout() to kick off + * regular writeback instead of writing things out itself. + */ + if (wbc->wb) + bio_associate_blkcg(bio, wbc->wb->blkcg_css); +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static inline void inode_attach_wb(struct inode *inode, struct page *page) +{ +} + +static inline void inode_detach_wb(struct inode *inode) +{ +} + +static inline void wbc_attach_and_unlock_inode(struct writeback_control *wbc, + struct inode *inode) + __releases(&inode->i_lock) +{ + spin_unlock(&inode->i_lock); +} + +static inline void wbc_attach_fdatawrite_inode(struct writeback_control *wbc, + struct inode *inode) +{ +} + +static inline void wbc_detach_inode(struct writeback_control *wbc) +{ +} + +static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio) +{ +} + +static inline void wbc_account_io(struct writeback_control *wbc, + struct page *page, size_t bytes) +{ +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + /* * mm/page-writeback.c */ @@ -120,8 +316,12 @@ static inline void laptop_sync_completion(void) { } #endif void throttle_vm_writeout(gfp_t gfp_mask); bool zone_dirty_ok(struct zone *zone); +int wb_domain_init(struct wb_domain *dom, gfp_t gfp); +#ifdef CONFIG_CGROUP_WRITEBACK +void wb_domain_exit(struct wb_domain *dom); +#endif -extern unsigned long global_dirty_limit; +extern struct wb_domain global_wb_domain; /* These are exported to sysctl. */ extern int dirty_background_ratio; @@ -155,19 +355,12 @@ int dirty_writeback_centisecs_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty); -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, - unsigned long dirty); - -void __bdi_update_bandwidth(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, - unsigned long start_time); +unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh); +void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time); void page_writeback_init(void); void balance_dirty_pages_ratelimited(struct address_space *mapping); +bool wb_over_bg_thresh(struct bdi_writeback *wb); typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc, void *data); diff --git a/include/trace/events/thermal.h b/include/trace/events/thermal.h index 0f4f95d63c03..8b1f80682b80 100644 --- a/include/trace/events/thermal.h +++ b/include/trace/events/thermal.h @@ -77,6 +77,64 @@ TRACE_EVENT(thermal_zone_trip, __entry->trip_type) ); +TRACE_EVENT(thermal_power_cpu_get_power, + TP_PROTO(const struct cpumask *cpus, unsigned long freq, u32 *load, + size_t load_len, u32 dynamic_power, u32 static_power), + + TP_ARGS(cpus, freq, load, load_len, dynamic_power, static_power), + + TP_STRUCT__entry( + __bitmask(cpumask, num_possible_cpus()) + __field(unsigned long, freq ) + __dynamic_array(u32, load, load_len) + __field(size_t, load_len ) + __field(u32, dynamic_power ) + __field(u32, static_power ) + ), + + TP_fast_assign( + __assign_bitmask(cpumask, cpumask_bits(cpus), + num_possible_cpus()); + __entry->freq = freq; + memcpy(__get_dynamic_array(load), load, + load_len * sizeof(*load)); + __entry->load_len = load_len; + __entry->dynamic_power = dynamic_power; + __entry->static_power = static_power; + ), + + TP_printk("cpus=%s freq=%lu load={%s} dynamic_power=%d static_power=%d", + __get_bitmask(cpumask), __entry->freq, + __print_array(__get_dynamic_array(load), __entry->load_len, 4), + __entry->dynamic_power, __entry->static_power) +); + +TRACE_EVENT(thermal_power_cpu_limit, + TP_PROTO(const struct cpumask *cpus, unsigned int freq, + unsigned long cdev_state, u32 power), + + TP_ARGS(cpus, freq, cdev_state, power), + + TP_STRUCT__entry( + __bitmask(cpumask, num_possible_cpus()) + __field(unsigned int, freq ) + __field(unsigned long, cdev_state) + __field(u32, power ) + ), + + TP_fast_assign( + __assign_bitmask(cpumask, cpumask_bits(cpus), + num_possible_cpus()); + __entry->freq = freq; + __entry->cdev_state = cdev_state; + __entry->power = power; + ), + + TP_printk("cpus=%s freq=%u cdev_state=%lu power=%u", + __get_bitmask(cpumask), __entry->freq, __entry->cdev_state, + __entry->power) +); + #endif /* _TRACE_THERMAL_H */ /* This part must be outside protection */ diff --git a/include/trace/events/thermal_power_allocator.h b/include/trace/events/thermal_power_allocator.h new file mode 100644 index 000000000000..12e1321c4e0c --- /dev/null +++ b/include/trace/events/thermal_power_allocator.h @@ -0,0 +1,87 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM thermal_power_allocator + +#if !defined(_TRACE_THERMAL_POWER_ALLOCATOR_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_THERMAL_POWER_ALLOCATOR_H + +#include <linux/tracepoint.h> + +TRACE_EVENT(thermal_power_allocator, + TP_PROTO(struct thermal_zone_device *tz, u32 *req_power, + u32 total_req_power, u32 *granted_power, + u32 total_granted_power, size_t num_actors, + u32 power_range, u32 max_allocatable_power, + unsigned long current_temp, s32 delta_temp), + TP_ARGS(tz, req_power, total_req_power, granted_power, + total_granted_power, num_actors, power_range, + max_allocatable_power, current_temp, delta_temp), + TP_STRUCT__entry( + __field(int, tz_id ) + __dynamic_array(u32, req_power, num_actors ) + __field(u32, total_req_power ) + __dynamic_array(u32, granted_power, num_actors) + __field(u32, total_granted_power ) + __field(size_t, num_actors ) + __field(u32, power_range ) + __field(u32, max_allocatable_power ) + __field(unsigned long, current_temp ) + __field(s32, delta_temp ) + ), + TP_fast_assign( + __entry->tz_id = tz->id; + memcpy(__get_dynamic_array(req_power), req_power, + num_actors * sizeof(*req_power)); + __entry->total_req_power = total_req_power; + memcpy(__get_dynamic_array(granted_power), granted_power, + num_actors * sizeof(*granted_power)); + __entry->total_granted_power = total_granted_power; + __entry->num_actors = num_actors; + __entry->power_range = power_range; + __entry->max_allocatable_power = max_allocatable_power; + __entry->current_temp = current_temp; + __entry->delta_temp = delta_temp; + ), + + TP_printk("thermal_zone_id=%d req_power={%s} total_req_power=%u granted_power={%s} total_granted_power=%u power_range=%u max_allocatable_power=%u current_temperature=%lu delta_temperature=%d", + __entry->tz_id, + __print_array(__get_dynamic_array(req_power), + __entry->num_actors, 4), + __entry->total_req_power, + __print_array(__get_dynamic_array(granted_power), + __entry->num_actors, 4), + __entry->total_granted_power, __entry->power_range, + __entry->max_allocatable_power, __entry->current_temp, + __entry->delta_temp) +); + +TRACE_EVENT(thermal_power_allocator_pid, + TP_PROTO(struct thermal_zone_device *tz, s32 err, s32 err_integral, + s64 p, s64 i, s64 d, s32 output), + TP_ARGS(tz, err, err_integral, p, i, d, output), + TP_STRUCT__entry( + __field(int, tz_id ) + __field(s32, err ) + __field(s32, err_integral) + __field(s64, p ) + __field(s64, i ) + __field(s64, d ) + __field(s32, output ) + ), + TP_fast_assign( + __entry->tz_id = tz->id; + __entry->err = err; + __entry->err_integral = err_integral; + __entry->p = p; + __entry->i = i; + __entry->d = d; + __entry->output = output; + ), + + TP_printk("thermal_zone_id=%d err=%d err_integral=%d p=%lld i=%lld d=%lld output=%d", + __entry->tz_id, __entry->err, __entry->err_integral, + __entry->p, __entry->i, __entry->d, __entry->output) +); +#endif /* _TRACE_THERMAL_POWER_ALLOCATOR_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index c178d13d6f4c..a7aa607a4c55 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -360,7 +360,7 @@ TRACE_EVENT(global_dirty_state, __entry->nr_written = global_page_state(NR_WRITTEN); __entry->background_thresh = background_thresh; __entry->dirty_thresh = dirty_thresh; - __entry->dirty_limit = global_dirty_limit; + __entry->dirty_limit = global_wb_domain.dirty_limit; ), TP_printk("dirty=%lu writeback=%lu unstable=%lu " @@ -399,13 +399,13 @@ TRACE_EVENT(bdi_dirty_ratelimit, TP_fast_assign( strlcpy(__entry->bdi, dev_name(bdi->dev), 32); - __entry->write_bw = KBps(bdi->write_bandwidth); - __entry->avg_write_bw = KBps(bdi->avg_write_bandwidth); + __entry->write_bw = KBps(bdi->wb.write_bandwidth); + __entry->avg_write_bw = KBps(bdi->wb.avg_write_bandwidth); __entry->dirty_rate = KBps(dirty_rate); - __entry->dirty_ratelimit = KBps(bdi->dirty_ratelimit); + __entry->dirty_ratelimit = KBps(bdi->wb.dirty_ratelimit); __entry->task_ratelimit = KBps(task_ratelimit); __entry->balanced_dirty_ratelimit = - KBps(bdi->balanced_dirty_ratelimit); + KBps(bdi->wb.balanced_dirty_ratelimit); ), TP_printk("bdi %s: " @@ -462,8 +462,9 @@ TRACE_EVENT(balance_dirty_pages, unsigned long freerun = (thresh + bg_thresh) / 2; strlcpy(__entry->bdi, dev_name(bdi->dev), 32); - __entry->limit = global_dirty_limit; - __entry->setpoint = (global_dirty_limit + freerun) / 2; + __entry->limit = global_wb_domain.dirty_limit; + __entry->setpoint = (global_wb_domain.dirty_limit + + freerun) / 2; __entry->dirty = dirty; __entry->bdi_setpoint = __entry->setpoint * bdi_thresh / (thresh + 1); diff --git a/include/uapi/linux/psci.h b/include/uapi/linux/psci.h index 310d83e0a91b..3d7a0fc021a7 100644 --- a/include/uapi/linux/psci.h +++ b/include/uapi/linux/psci.h @@ -46,6 +46,11 @@ #define PSCI_0_2_FN64_MIGRATE PSCI_0_2_FN64(5) #define PSCI_0_2_FN64_MIGRATE_INFO_UP_CPU PSCI_0_2_FN64(7) +#define PSCI_1_0_FN_PSCI_FEATURES PSCI_0_2_FN(10) +#define PSCI_1_0_FN_SYSTEM_SUSPEND PSCI_0_2_FN(14) + +#define PSCI_1_0_FN64_SYSTEM_SUSPEND PSCI_0_2_FN64(14) + /* PSCI v0.2 power state encoding for CPU_SUSPEND function */ #define PSCI_0_2_POWER_STATE_ID_MASK 0xffff #define PSCI_0_2_POWER_STATE_ID_SHIFT 0 @@ -56,6 +61,13 @@ #define PSCI_0_2_POWER_STATE_AFFL_MASK \ (0x3 << PSCI_0_2_POWER_STATE_AFFL_SHIFT) +/* PSCI extended power state encoding for CPU_SUSPEND function */ +#define PSCI_1_0_EXT_POWER_STATE_ID_MASK 0xfffffff +#define PSCI_1_0_EXT_POWER_STATE_ID_SHIFT 0 +#define PSCI_1_0_EXT_POWER_STATE_TYPE_SHIFT 30 +#define PSCI_1_0_EXT_POWER_STATE_TYPE_MASK \ + (0x1 << PSCI_1_0_EXT_POWER_STATE_TYPE_SHIFT) + /* PSCI v0.2 affinity level state returned by AFFINITY_INFO */ #define PSCI_0_2_AFFINITY_LEVEL_ON 0 #define PSCI_0_2_AFFINITY_LEVEL_OFF 1 @@ -76,6 +88,11 @@ #define PSCI_VERSION_MINOR(ver) \ ((ver) & PSCI_VERSION_MINOR_MASK) +/* PSCI features decoding (>=1.0) */ +#define PSCI_1_0_FEATURES_CPU_SUSPEND_PF_SHIFT 1 +#define PSCI_1_0_FEATURES_CPU_SUSPEND_PF_MASK \ + (0x1 << PSCI_1_0_FEATURES_CPU_SUSPEND_PF_SHIFT) + /* PSCI return values (inclusive of all PSCI versions) */ #define PSCI_RET_SUCCESS 0 #define PSCI_RET_NOT_SUPPORTED -1 @@ -86,5 +103,6 @@ #define PSCI_RET_INTERNAL_FAILURE -6 #define PSCI_RET_NOT_PRESENT -7 #define PSCI_RET_DISABLED -8 +#define PSCI_RET_INVALID_ADDRESS -9 #endif /* _UAPI_LINUX_PSCI_H */ diff --git a/init/Kconfig b/init/Kconfig index dc24dec60232..d4f763332f9f 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1141,6 +1141,11 @@ config DEBUG_BLK_CGROUP Enable some debugging help. Currently it exports additional stat files in a cgroup which can be useful for debugging. +config CGROUP_WRITEBACK + bool + depends on MEMCG && BLK_CGROUP + default y + endif # CGROUPS config CHECKPOINT_RESTORE diff --git a/kernel/irq/irqdomain.c b/kernel/irq/irqdomain.c index 7fac311057b8..021f823794e7 100644 --- a/kernel/irq/irqdomain.c +++ b/kernel/irq/irqdomain.c @@ -187,10 +187,12 @@ struct irq_domain *irq_domain_add_legacy(struct device_node *of_node, EXPORT_SYMBOL_GPL(irq_domain_add_legacy); /** - * irq_find_host() - Locates a domain for a given device node + * irq_find_matching_host() - Locates a domain for a given device node * @node: device-tree node of the interrupt controller + * @bus_token: domain-specific data */ -struct irq_domain *irq_find_host(struct device_node *node) +struct irq_domain *irq_find_matching_host(struct device_node *node, + enum irq_domain_bus_token bus_token) { struct irq_domain *h, *found = NULL; int rc; @@ -199,13 +201,19 @@ struct irq_domain *irq_find_host(struct device_node *node) * it might potentially be set to match all interrupts in * the absence of a device node. This isn't a problem so far * yet though... + * + * bus_token == DOMAIN_BUS_ANY matches any domain, any other + * values must generate an exact match for the domain to be + * selected. */ mutex_lock(&irq_domain_mutex); list_for_each_entry(h, &irq_domain_list, link) { if (h->ops->match) - rc = h->ops->match(h, node); + rc = h->ops->match(h, node, bus_token); else - rc = (h->of_node != NULL) && (h->of_node == node); + rc = ((h->of_node != NULL) && (h->of_node == node) && + ((bus_token == DOMAIN_BUS_ANY) || + (h->bus_token == bus_token))); if (rc) { found = h; @@ -215,7 +223,7 @@ struct irq_domain *irq_find_host(struct device_node *node) mutex_unlock(&irq_domain_mutex); return found; } -EXPORT_SYMBOL_GPL(irq_find_host); +EXPORT_SYMBOL_GPL(irq_find_matching_host); /** * irq_set_default_host() - Set a "default" irq domain diff --git a/linaro/configs/android.conf b/linaro/configs/android.conf new file mode 100644 index 000000000000..f1aefaebcc93 --- /dev/null +++ b/linaro/configs/android.conf @@ -0,0 +1,47 @@ +CONFIG_IPV6=y +# CONFIG_IPV6_SIT is not set +CONFIG_PANIC_TIMEOUT=0 +CONFIG_HAS_WAKELOCK=y +CONFIG_WAKELOCK=y +CONFIG_BLK_DEV_LOOP=y +CONFIG_DM_CRYPT=y +CONFIG_POWER_SUPPLY=y +CONFIG_ANDROID_PARANOID_NETWORK=y +CONFIG_NET_ACTIVITY_STATS=y +CONFIG_INPUT_MISC=y +CONFIG_INPUT_UINPUT=y +CONFIG_INPUT_GPIO=y +CONFIG_USB_G_ANDROID=y +CONFIG_SWITCH=y +CONFIG_STAGING=y +CONFIG_ANDROID=y +CONFIG_ANDROID_BINDER_IPC=y +CONFIG_ASHMEM=y +CONFIG_ANDROID_LOGGER=y +CONFIG_ANDROID_TIMED_OUTPUT=y +CONFIG_ANDROID_TIMED_GPIO=y +CONFIG_ANDROID_LOW_MEMORY_KILLER=y +CONFIG_ANDROID_INTF_ALARM_DEV=y +CONFIG_CRYPTO_TWOFISH=y +CONFIG_BLK_DEV_RAM=y +CONFIG_BLK_DEV_RAM_COUNT=16 +CONFIG_BLK_DEV_RAM_SIZE=16384 +CONFIG_FUSE_FS=y +CONFIG_CPU_FREQ_GOV_INTERACTIVE=y +CONFIG_CPU_FREQ_DEFAULT_GOV_INTERACTIVE=y +CONFIG_SYNC=y +CONFIG_SW_SYNC=y +CONFIG_SW_SYNC_USER=y +CONFIG_ION=y +CONFIG_ION_TEST=y +CONFIG_ION_DUMMY=y +CONFIG_AUDIT=y +CONFIG_NF_CONNTRACK_SECMARK=y +CONFIG_NETFILTER_XT_TARGET_CONNSECMARK=y +CONFIG_NETFILTER_XT_TARGET_SECMARK=y +CONFIG_IP_NF_SECURITY=y +CONFIG_SECURITY=y +CONFIG_SECURITY_NETWORK=y +CONFIG_LSM_MMAP_MIN_ADDR=4096 +CONFIG_SECURITY_SELINUX=y +CONFIG_EXT4_FS_SECURITY=y diff --git a/linaro/configs/arndale.conf b/linaro/configs/arndale.conf new file mode 100644 index 000000000000..6d6f6de3024f --- /dev/null +++ b/linaro/configs/arndale.conf @@ -0,0 +1,98 @@ +CONFIG_KALLSYMS_ALL=y +CONFIG_PARTITION_ADVANCED=y +CONFIG_BSD_DISKLABEL=y +CONFIG_SOLARIS_X86_PARTITION=y +CONFIG_ARCH_EXYNOS=y +CONFIG_S3C_LOWLEVEL_UART_PORT=2 +CONFIG_ARCH_EXYNOS5=y +CONFIG_VMSPLIT_2G=y +CONFIG_NR_CPUS=2 +# CONFIG_THUMB2_KERNEL is not set +CONFIG_HIGHMEM=y +# CONFIG_COMPACTION is not set +CONFIG_ARM_APPENDED_DTB=y +CONFIG_ARM_ATAG_DTB_COMPAT=y +CONFIG_CMDLINE="root=/dev/ram0 rw ramdisk=8192 initrd=0x41000000,8M console=ttySAC1,115200 init= mem=256M" +CONFIG_CPU_FREQ_GOV_USERSPACE=y +CONFIG_VFP=y +CONFIG_NEON=y +CONFIG_PM_RUNTIME=y +CONFIG_CMA=y +CONFIG_DMA_CMA=y +CONFIG_CMA_SIZE_MBYTES=128 +CONFIG_BLK_DEV_LOOP=y +CONFIG_BLK_DEV_SD=y +CONFIG_CHR_DEV_SG=y +CONFIG_ATA=y +CONFIG_SATA_AHCI_PLATFORM=y +CONFIG_SATA_EXYNOS=y +CONFIG_AX88796=y +CONFIG_AX88796_93CX6=y +CONFIG_USB_USBNET=y +CONFIG_USB_NET_DM9601=y +CONFIG_USB_NET_MCS7830=y +CONFIG_INPUT_EVDEV=y +CONFIG_KEYBOARD_GPIO=y +CONFIG_INPUT_TOUCHSCREEN=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_SAMSUNG=y +CONFIG_SERIAL_SAMSUNG_CONSOLE=y +CONFIG_HW_RANDOM=y +CONFIG_I2C=y +CONFIG_I2C_S3C2410=y +CONFIG_THERMAL=y +CONFIG_CPU_THERMAL=y +CONFIG_EXYNOS_THERMAL=y +CONFIG_EXYNOS_THERMAL_CORE=y +CONFIG_MFD_SEC_CORE=y +CONFIG_REGULATOR=y +CONFIG_REGULATOR_FIXED_VOLTAGE=y +CONFIG_REGULATOR_S5M8767=y +CONFIG_DRM=y +CONFIG_DRM_LOAD_EDID_FIRMWARE=y +CONFIG_DRM_EXYNOS=y +CONFIG_DRM_EXYNOS_DMABUF=y +CONFIG_DRM_EXYNOS_HDMI=y +CONFIG_FRAMEBUFFER_CONSOLE=y +CONFIG_LOGO=y +CONFIG_SOUND=y +CONFIG_SND=y +CONFIG_SND_SOC=y +CONFIG_SND_SOC_SAMSUNG=y +CONFIG_SND_SOC_SMDK_I2S_STUB=y +CONFIG_USB=y +CONFIG_USB_ANNOUNCE_NEW_DEVICES=y +CONFIG_EXTCON=y +CONFIG_USB_DWC3=y +CONFIG_USB_DWC3_HOST=y +CONFIG_USB_XHCI_HCD=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_EHCI_ROOT_HUB_TT=y +CONFIG_USB_EHCI_EXYNOS=y +CONFIG_USB_OHCI_HCD=y +CONFIG_USB_OHCI_EXYNOS=y +CONFIG_USB_STORAGE=y +CONFIG_USB_HSIC_USB3503=y +CONFIG_USB_GADGET=y +CONFIG_MMC=y +CONFIG_MMC_UNSAFE_RESUME=y +CONFIG_MMC_DW=y +CONFIG_MMC_DW_IDMAC=y +CONFIG_MMC_DW_EXYNOS=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_S3C=y +CONFIG_PHY_EXYNOS5250_SATA=y +CONFIG_PHY_SAMSUNG_USB2=y +CONFIG_PHY_EXYNOS4210_USB2=y +CONFIG_PHY_EXYNOS4X12_USB2=y +CONFIG_PHY_EXYNOS5250_USB2=y +CONFIG_EXYNOS_SIMPLE_PHY=y +CONFIG_PHY_EXYNOS5_USBDRD=y +CONFIG_DEBUG_KERNEL=y +CONFIG_DETECT_HUNG_TASK=y +CONFIG_DEBUG_RT_MUTEXES=y +CONFIG_DEBUG_SPINLOCK=y +CONFIG_DEBUG_INFO=y +CONFIG_RCU_CPU_STALL_TIMEOUT=60 +CONFIG_DEBUG_USER=y +CONFIG_TUN=y diff --git a/linaro/configs/arndale_octa.conf b/linaro/configs/arndale_octa.conf new file mode 100644 index 000000000000..b8c42a665ff4 --- /dev/null +++ b/linaro/configs/arndale_octa.conf @@ -0,0 +1,86 @@ +CONFIG_KALLSYMS_ALL=y +CONFIG_PARTITION_ADVANCED=y +CONFIG_ARCH_EXYNOS=y +CONFIG_S3C_LOWLEVEL_UART_PORT=3 +CONFIG_ARCH_EXYNOS5=y +CONFIG_NR_CPUS=8 +# CONFIG_THUMB2_KERNEL is not set +CONFIG_HIGHMEM=y +CONFIG_CMA=y +CONFIG_ARM_APPENDED_DTB=y +CONFIG_ARM_ATAG_DTB_COMPAT=y +CONFIG_CMDLINE="root=/dev/ram0 rw ramdisk=8192 initrd=0x41000000,8M console=ttySAC3,115200 init=/linuxrc mem=256M" +CONFIG_VFP=y +CONFIG_NEON=y +CONFIG_PM_RUNTIME=y +CONFIG_DMA_CMA=y +CONFIG_CMA_SIZE_MBYTES=128 +CONFIG_SCSI=y +CONFIG_BLK_DEV_LOOP=y +CONFIG_AX88796=y +CONFIG_AX88796_93CX6=y +CONFIG_USB_USBNET=y +CONFIG_USB_NET_DM9601=y +CONFIG_USB_NET_MCS7830=y +CONFIG_INPUT_EVDEV=y +CONFIG_KEYBOARD_GPIO=y +CONFIG_INPUT_TOUCHSCREEN=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_SAMSUNG=y +CONFIG_SERIAL_SAMSUNG_CONSOLE=y +CONFIG_HW_RANDOM=y +CONFIG_I2C_EXYNOS5=y +CONFIG_I2C_S3C2410=y +CONFIG_THERMAL=y +CONFIG_CPU_THERMAL=y +CONFIG_EXYNOS_THERMAL=y +CONFIG_EXYNOS_THERMAL_CORE=y +CONFIG_MFD_SEC_CORE=y +CONFIG_REGULATOR=y +CONFIG_REGULATOR_FIXED_VOLTAGE=y +CONFIG_REGULATOR_S2MPS11=y +CONFIG_DRM=y +CONFIG_DRM_LOAD_EDID_FIRMWARE=y +CONFIG_DRM_EXYNOS=y +CONFIG_DRM_EXYNOS_DMABUF=y +CONFIG_DRM_EXYNOS_HDMI=y +CONFIG_FRAMEBUFFER_CONSOLE=y +CONFIG_LOGO=y +CONFIG_SOUND=y +CONFIG_SND=y +CONFIG_SND_SOC=y +CONFIG_SND_SOC_SAMSUNG=y +CONFIG_SND_SOC_SMDK_I2S_STUB=y +CONFIG_USB=y +CONFIG_USB_ANNOUNCE_NEW_DEVICES=y +CONFIG_USB_XHCI_HCD=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_EHCI_ROOT_HUB_TT=y +CONFIG_USB_EHCI_EXYNOS=y +CONFIG_USB_OHCI_HCD=y +CONFIG_USB_OHCI_EXYNOS=y +CONFIG_USB_STORAGE=y +CONFIG_USB_DWC3=y +CONFIG_USB_DWC3_HOST=y +CONFIG_USB_HSIC_USB3503=y +CONFIG_USB_GADGET=y +CONFIG_MMC=y +CONFIG_MMC_UNSAFE_RESUME=y +CONFIG_MMC_DW=y +CONFIG_MMC_DW_IDMAC=y +CONFIG_MMC_DW_EXYNOS=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_S3C=y +CONFIG_EXTCON=y +CONFIG_PHY_SAMSUNG_USB2=y +CONFIG_PHY_EXYNOS4210_USB2=y +CONFIG_PHY_EXYNOS4X12_USB2=y +CONFIG_PHY_EXYNOS5250_USB2=y +CONFIG_EXYNOS_SIMPLE_PHY=y +CONFIG_PHY_EXYNOS5_USBDRD=y +CONFIG_DEBUG_KERNEL=y +CONFIG_DETECT_HUNG_TASK=y +CONFIG_DEBUG_RT_MUTEXES=y +CONFIG_DEBUG_SPINLOCK=y +CONFIG_DEBUG_MUTEXES=y +CONFIG_DEBUG_USER=y diff --git a/linaro/configs/audit.conf b/linaro/configs/audit.conf new file mode 100644 index 000000000000..19e21170b5db --- /dev/null +++ b/linaro/configs/audit.conf @@ -0,0 +1,9 @@ +CONFIG_AUDIT=y +CONFIG_HAVE_ARCH_AUDITSYSCALL=y +CONFIG_AUDITSYSCALL=y +CONFIG_AUDIT_WATCH=y +CONFIG_AUDIT_TREE=y +CONFIG_INTEGRITY_AUDIT=y +CONFIG_IMA=y +CONFIG_EVM=y +CONFIG_AUDIT_GENERIC=y diff --git a/linaro/configs/beaglebone.conf b/linaro/configs/beaglebone.conf new file mode 100644 index 000000000000..bb7eefb491a4 --- /dev/null +++ b/linaro/configs/beaglebone.conf @@ -0,0 +1,22 @@ +CONFIG_SOC_AM33XX=y +# CONFIG_THUMB2_KERNEL is not set +# CONFIG_ARM_APPENDED_DTB is not set +CONFIG_MFD_TPS65217=y +CONFIG_REGULATOR_TPS65217=y +CONFIG_TI_CPSW=y +CONFIG_SMSC_PHY=y +CONFIG_USB_GADGET=y +CONFIG_USB_OTG=y +CONFIG_USB_MUSB_HDRC=y +CONFIG_USB_MUSB_DUAL_ROLE=y +CONFIG_USB_MUSB_DSPS=y +CONFIG_USB_MUSB_AM335X_CHILD=y +CONFIG_USB_TI_CPPI41_DMA=y +CONFIG_TI_CPPI41=y +CONFIG_AM335X_PHY_USB=y +# CONFIG_USB_OTG_WHITELIST is not set +CONFIG_USB_STORAGE=y +CONFIG_USB_GADGET_VBUS_DRAW=250 +CONFIG_USB_ETH=m +CONFIG_USB_ETH_RNDIS=y +CONFIG_USB_MASS_STORAGE=m diff --git a/linaro/configs/big-LITTLE-IKS.conf b/linaro/configs/big-LITTLE-IKS.conf new file mode 100644 index 000000000000..eccb00f851e0 --- /dev/null +++ b/linaro/configs/big-LITTLE-IKS.conf @@ -0,0 +1,4 @@ +CONFIG_BIG_LITTLE=y +CONFIG_BL_SWITCHER=y +CONFIG_ARM_DT_BL_CPUFREQ=y +CONFIG_CPU_FREQ_GOV_USERSPACE=y diff --git a/linaro/configs/bigendian.conf b/linaro/configs/bigendian.conf new file mode 100644 index 000000000000..6a1020299e85 --- /dev/null +++ b/linaro/configs/bigendian.conf @@ -0,0 +1,4 @@ +CONFIG_CPU_BIG_ENDIAN=y +CONFIG_CPU_ENDIAN_BE8=y +# CONFIG_VIRTUALIZATION is not set +# CONFIG_MMC_DW_IDMAC is not set diff --git a/linaro/configs/capri.conf b/linaro/configs/capri.conf new file mode 100644 index 000000000000..9eb4bd3a079b --- /dev/null +++ b/linaro/configs/capri.conf @@ -0,0 +1,90 @@ +# CONFIG_SWAP is not set +CONFIG_BSD_PROCESS_ACCT_V3=y +CONFIG_LOG_BUF_SHIFT=19 +CONFIG_CGROUP_FREEZER=y +CONFIG_CGROUP_DEVICE=y +CONFIG_CGROUP_CPUACCT=y +CONFIG_RESOURCE_COUNTERS=y +CONFIG_CGROUP_SCHED=y +CONFIG_BLK_CGROUP=y +CONFIG_NAMESPACES=y +CONFIG_SYSCTL_SYSCALL=y +# CONFIG_BLK_DEV_BSG is not set +CONFIG_PARTITION_ADVANCED=y +CONFIG_ARCH_BCM=y +CONFIG_ARCH_BCM_MOBILE=y +CONFIG_ARM_THUMBEE=y +CONFIG_PREEMPT=y +# CONFIG_COMPACTION is not set +CONFIG_ZBOOT_ROM_TEXT=0x0 +CONFIG_ZBOOT_ROM_BSS=0x0 +CONFIG_CMDLINE="console=ttyS0,115200n8 mem=128M" +CONFIG_VFP=y +CONFIG_NEON=y +# CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is not set +CONFIG_PM_RUNTIME=y +CONFIG_PACKET_DIAG=y +CONFIG_UNIX_DIAG=y +CONFIG_TCP_MD5SIG=y +# CONFIG_BLK_DEV is not set +CONFIG_SCSI=y +CONFIG_BLK_DEV_SD=y +CONFIG_CHR_DEV_SG=y +CONFIG_SCSI_MULTI_LUN=y +CONFIG_SCSI_SCAN_ASYNC=y +CONFIG_INPUT_FF_MEMLESS=y +CONFIG_INPUT_JOYDEV=y +CONFIG_INPUT_EVDEV=y +# CONFIG_KEYBOARD_ATKBD is not set +# CONFIG_INPUT_MOUSE is not set +CONFIG_INPUT_TOUCHSCREEN=y +# CONFIG_SERIO is not set +# CONFIG_LEGACY_PTYS is not set +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_SERIAL_8250_EXTENDED=y +CONFIG_SERIAL_8250_MANY_PORTS=y +CONFIG_SERIAL_8250_SHARE_IRQ=y +CONFIG_SERIAL_8250_RSA=y +CONFIG_SERIAL_8250_DW=y +CONFIG_HW_RANDOM=y +CONFIG_I2C=y +CONFIG_I2C_CHARDEV=y +# CONFIG_HWMON is not set +CONFIG_WATCHDOG=y +CONFIG_BCM_KONA_WDT=y +CONFIG_BCM_KONA_WDT_DEBUG=y +CONFIG_VIDEO_OUTPUT_CONTROL=y +CONFIG_FB=y +CONFIG_BACKLIGHT_LCD_SUPPORT=y +CONFIG_LCD_CLASS_DEVICE=y +CONFIG_BACKLIGHT_CLASS_DEVICE=y +CONFIG_BACKLIGHT_PWM=y +CONFIG_USB_GADGET=y +CONFIG_USB_S3C_HSOTG=y +CONFIG_USB_ETH=y +CONFIG_MMC=y +CONFIG_MMC_UNSAFE_RESUME=y +CONFIG_MMC_BLOCK_MINORS=32 +CONFIG_MMC_TEST=y +CONFIG_MMC_SDHCI=y +CONFIG_MMC_SDHCI_BCM_KONA=y +CONFIG_NEW_LEDS=y +CONFIG_LEDS_CLASS=y +CONFIG_LEDS_TRIGGERS=y +CONFIG_LEDS_TRIGGER_TIMER=y +CONFIG_LEDS_TRIGGER_HEARTBEAT=y +CONFIG_LEDS_TRIGGER_DEFAULT_ON=y +CONFIG_PWM=y +CONFIG_GENERIC_PHY=y +CONFIG_BCM_KONA_USB2_PHY=y +CONFIG_EXT4_FS_POSIX_ACL=y +CONFIG_EXT4_FS_SECURITY=y +CONFIG_FUSE_FS=y +CONFIG_DEBUG_INFO=y +CONFIG_DETECT_HUNG_TASK=y +CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=110 +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y +# CONFIG_FTRACE is not set +CONFIG_XZ_DEC=y +CONFIG_AVERAGE=y diff --git a/linaro/configs/debug.conf b/linaro/configs/debug.conf new file mode 100644 index 000000000000..36980566b2d8 --- /dev/null +++ b/linaro/configs/debug.conf @@ -0,0 +1 @@ +CONFIG_PROVE_LOCKING=y diff --git a/linaro/configs/distribution.conf b/linaro/configs/distribution.conf new file mode 100644 index 000000000000..6329bd98cab4 --- /dev/null +++ b/linaro/configs/distribution.conf @@ -0,0 +1,65 @@ +CONFIG_FHANDLE=y +CONFIG_CGROUPS=y +CONFIG_NAMESPACES=y +# CONFIG_COMPAT_BRK is not set +CONFIG_DEFAULT_MMAP_MIN_ADDR=32768 +CONFIG_SECCOMP=y +CONFIG_CC_STACKPROTECTOR_REGULAR=y +CONFIG_SYN_COOKIES=y +CONFIG_IPV6=m +CONFIG_NETLABEL=y +CONFIG_BRIDGE_NETFILTER=y +CONFIG_NF_CONNTRACK=m +CONFIG_NF_CT_NETLINK=m +CONFIG_NETFILTER_NETLINK=m +CONFIG_NETFILTER_NETLINK_LOG=m +CONFIG_NETFILTER_XT_CONNMARK=m +CONFIG_NETFILTER_XT_MARK=m +CONFIG_NETFILTER_XT_MATCH_CONNMARK=m +CONFIG_NETFILTER_XT_MATCH_CONNTRACK=m +CONFIG_NETFILTER_XT_MATCH_LIMIT=m +CONFIG_NETFILTER_XT_MATCH_MAC=m +CONFIG_NETFILTER_XT_MATCH_MULTIPORT=m +CONFIG_NETFILTER_XT_MATCH_STATE=m +CONFIG_NETFILTER_XT_TARGET_CHECKSUM=m +CONFIG_NETFILTER_XT_TARGET_LOG=m +CONFIG_NETFILTER_XT_TARGET_REDIRECT=m +CONFIG_NF_CONNTRACK_IPV4=m +CONFIG_NF_NAT_IPV4=m +CONFIG_IP_NF_TARGET_MASQUERADE=m +CONFIG_IP_NF_IPTABLES=m +CONFIG_IP_NF_FILTER=m +CONFIG_IP_NF_MANGLE=m +CONFIG_IP_NF_TARGET_REJECT=m +CONFIG_NF_CONNTRACK_IPV6=m +CONFIG_NF_NAT_IPV6=m +CONFIG_IP6_NF_TARGET_MASQUERADE=m +CONFIG_IP6_NF_IPTABLES=m +CONFIG_IP6_NF_FILTER=m +CONFIG_IP6_NF_MANGLE=m +CONFIG_IP6_NF_TARGET_REJECT=m +CONFIG_BRIDGE_NF_EBTABLES=m +CONFIG_BRIDGE_EBT_T_FILTER=m +CONFIG_BRIDGE_EBT_T_NAT=m +CONFIG_BRIDGE_EBT_MARK_T=m +CONFIG_BRIDGE=m +# CONFIG_UEVENT_HELPER is not set +CONFIG_DEVTMPFS=y +CONFIG_DEVTMPFS_MOUNT=y +CONFIG_BLK_DEV_LOOP=y +CONFIG_BLK_DEV_RAM=y +CONFIG_BLK_DEV_RAM_SIZE=65536 +CONFIG_INPUT_MISC=y +CONFIG_INPUT_UINPUT=y +# CONFIG_DEVKMEM is not set +CONFIG_FRAMEBUFFER_CONSOLE=y +CONFIG_AUTOFS4_FS=y +CONFIG_TMPFS_POSIX_ACL=y +CONFIG_STRICT_DEVMEM=y +CONFIG_SECURITY=y +CONFIG_LSM_MMAP_MIN_ADDR=0 +CONFIG_SECURITY_SELINUX=y +CONFIG_SECURITY_SMACK=y +CONFIG_SECURITY_APPARMOR=y +CONFIG_DEFAULT_SECURITY_APPARMOR=y +CONFIG_OF_SELFTEST=y diff --git a/linaro/configs/gcov.conf b/linaro/configs/gcov.conf new file mode 100644 index 000000000000..2d1209d1cf55 --- /dev/null +++ b/linaro/configs/gcov.conf @@ -0,0 +1,3 @@ +CONFIG_GCOV_KERNEL=y +CONFIG_GCOV_PROFILE_ALL=y +CONFIG_GCOV_FORMAT_AUTODETECT=y diff --git a/linaro/configs/hi3xxx.conf b/linaro/configs/hi3xxx.conf new file mode 100644 index 000000000000..0c46b02c842c --- /dev/null +++ b/linaro/configs/hi3xxx.conf @@ -0,0 +1,18 @@ +CONFIG_ARCH_HI3xxx=y +CONFIG_SMP=y +CONFIG_HOTPLUG_CPU=y +CONFIG_MFD_HI6421_PMIC=y +CONFIG_REGULATOR_HI6421=y +CONFIG_MMC_DW=y +CONFIG_MMC_DW_IDMAC=y +CONFIG_MMC_DW_PLTFM=y +CONFIG_MMC_DW_K3=y +CONFIG_MMC_BLOCK_MINORS=32 +CONFIG_K3_DMA=y +CONFIG_INPUT_TOUCHSCREEN=y +CONFIG_TOUCHSCREEN_MXT224E=y +CONFIG_INPUT_EVDEV=y +CONFIG_INPUT_MISC=y +CONFIG_INPUT_HI6421_ONKEY=y +CONFIG_RTC_DRV_HI6421=y +# CONFIG_THUMB2_KERNEL is not set diff --git a/linaro/configs/highbank.conf b/linaro/configs/highbank.conf new file mode 100644 index 000000000000..33b978128777 --- /dev/null +++ b/linaro/configs/highbank.conf @@ -0,0 +1,42 @@ +CONFIG_EXPERIMENTAL=y +CONFIG_NO_HZ=y +CONFIG_HIGHMEM=y +CONFIG_HIGHPTE=y +CONFIG_HIGH_RES_TIMERS=y +CONFIG_ARCH_HIGHBANK=y +CONFIG_ARM_ERRATA_754322=y +CONFIG_SMP=y +CONFIG_SCHED_MC=y +CONFIG_AEABI=y +CONFIG_CMDLINE="console=ttyAMA0" +CONFIG_CPU_IDLE=y +CONFIG_VFP=y +CONFIG_NEON=y +CONFIG_NET=y +CONFIG_SCSI=y +CONFIG_BLK_DEV_SD=y +CONFIG_ATA=y +CONFIG_SATA_AHCI_PLATFORM=y +CONFIG_SATA_HIGHBANK=y +CONFIG_NETDEVICES=y +CONFIG_NET_CALXEDA_XGMAC=y +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +CONFIG_IPMI_HANDLER=y +CONFIG_IPMI_SI=y +CONFIG_I2C=y +CONFIG_I2C_DESIGNWARE_PLATFORM=y +CONFIG_SPI=y +CONFIG_SPI_PL022=y +CONFIG_GPIO_PL061=y +CONFIG_MMC=y +CONFIG_MMC_SDHCI=y +CONFIG_MMC_SDHCI_PLTFM=y +CONFIG_EDAC=y +CONFIG_EDAC_MM_EDAC=y +CONFIG_EDAC_HIGHBANK_MC=y +CONFIG_EDAC_HIGHBANK_L2=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_PL031=y +CONFIG_DMADEVICES=y +CONFIG_PL330_DMA=y diff --git a/linaro/configs/hugepage.conf b/linaro/configs/hugepage.conf new file mode 100644 index 000000000000..85f56540acdd --- /dev/null +++ b/linaro/configs/hugepage.conf @@ -0,0 +1,3 @@ +CONFIG_HUGETLBFS=y +CONFIG_TRANSPARENT_HUGEPAGE=y +CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y diff --git a/linaro/configs/ifc6410.conf b/linaro/configs/ifc6410.conf new file mode 100644 index 000000000000..10ccca0928aa --- /dev/null +++ b/linaro/configs/ifc6410.conf @@ -0,0 +1,327 @@ +CONFIG_FHANDLE=y +CONFIG_IRQ_DOMAIN_DEBUG=y +CONFIG_PARTITION_ADVANCED=y +CONFIG_ARCH_VIRT=y +CONFIG_ARCH_BCM=y +CONFIG_ARCH_BCM_MOBILE=y +CONFIG_ARCH_BCM_5301X=y +CONFIG_ARCH_HIGHBANK=y +CONFIG_ARCH_HISI=y +CONFIG_ARCH_HI3xxx=y +CONFIG_ARCH_KEYSTONE=y +CONFIG_ARCH_MXC=y +CONFIG_MACH_IMX51_DT=y +CONFIG_SOC_IMX53=y +CONFIG_SOC_IMX6Q=y +CONFIG_SOC_IMX6SL=y +CONFIG_SOC_VF610=y +CONFIG_ARCH_OMAP3=y +CONFIG_ARCH_OMAP4=y +CONFIG_SOC_OMAP5=y +CONFIG_SOC_AM33XX=y +CONFIG_SOC_AM43XX=y +CONFIG_SOC_DRA7XX=y +CONFIG_ARCH_QCOM=y +CONFIG_ARCH_MSM8X60=y +CONFIG_ARCH_MSM8960=y +CONFIG_ARCH_MSM8974=y +CONFIG_ARCH_ROCKCHIP=y +CONFIG_ARCH_SOCFPGA=y +CONFIG_PLAT_SPEAR=y +CONFIG_ARCH_SPEAR13XX=y +CONFIG_MACH_SPEAR1310=y +CONFIG_MACH_SPEAR1340=y +CONFIG_ARCH_STI=y +CONFIG_ARCH_EXYNOS=y +CONFIG_ARCH_SUNXI=y +CONFIG_ARCH_SIRF=y +CONFIG_ARCH_TEGRA=y +CONFIG_ARCH_TEGRA_2x_SOC=y +CONFIG_ARCH_TEGRA_3x_SOC=y +CONFIG_ARCH_TEGRA_114_SOC=y +CONFIG_ARCH_TEGRA_124_SOC=y +CONFIG_ARCH_U8500=y +CONFIG_MACH_HREFV60=y +CONFIG_MACH_SNOWBALL=y +CONFIG_ARCH_VEXPRESS=y +CONFIG_ARCH_VEXPRESS_CA9X4=y +CONFIG_ARCH_WM8850=y +CONFIG_ARCH_ZYNQ=y +CONFIG_ARM_THUMBEE=y +CONFIG_PCI=y +CONFIG_PCI_MSI=y +CONFIG_PCI_TEGRA=y +CONFIG_NR_CPUS=8 +CONFIG_HIGHPTE=y +CONFIG_CMA=y +CONFIG_ARM_APPENDED_DTB=y +CONFIG_ARM_ATAG_DTB_COMPAT=y +CONFIG_KEXEC=y +CONFIG_CPU_FREQ_STAT_DETAILS=y +CONFIG_IPV6_ROUTER_PREF=y +CONFIG_IPV6_OPTIMISTIC_DAD=y +CONFIG_INET6_AH=m +CONFIG_INET6_ESP=m +CONFIG_INET6_IPCOMP=m +CONFIG_IPV6_MIP6=m +CONFIG_IPV6_TUNNEL=m +CONFIG_IPV6_MULTIPLE_TABLES=y +CONFIG_CAN=y +CONFIG_CAN_MCP251X=y +CONFIG_CFG80211=m +CONFIG_MAC80211=m +CONFIG_RFKILL=y +CONFIG_RFKILL_INPUT=y +CONFIG_RFKILL_GPIO=y +CONFIG_DMA_CMA=y +CONFIG_CMA_SIZE_MBYTES=64 +CONFIG_OMAP_OCP2SCP=y +CONFIG_ICS932S401=y +CONFIG_APDS9802ALS=y +CONFIG_ISL29003=y +CONFIG_EEPROM_AT24=y +CONFIG_EEPROM_SUNXI_SID=y +CONFIG_BLK_DEV_SD=y +CONFIG_BLK_DEV_SR=y +CONFIG_SCSI_MULTI_LUN=y +CONFIG_ATA=y +CONFIG_SATA_AHCI_PLATFORM=y +CONFIG_AHCI_QCOM=y +CONFIG_AHCI_SUNXI=y +CONFIG_SATA_HIGHBANK=y +CONFIG_SATA_MV=y +CONFIG_SUN4I_EMAC=y +CONFIG_ATL1C=y +CONFIG_MACB=y +CONFIG_NET_CALXEDA_XGMAC=y +CONFIG_MVMDIO=y +CONFIG_KS8851=y +CONFIG_R8169=y +CONFIG_SMSC911X=y +CONFIG_STMMAC_ETH=y +CONFIG_TI_CPSW=y +CONFIG_AT803X_PHY=y +CONFIG_MARVELL_PHY=y +CONFIG_ICPLUS_PHY=y +CONFIG_USB_PEGASUS=y +CONFIG_USB_USBNET=y +CONFIG_USB_NET_SMSC75XX=y +CONFIG_USB_NET_SMSC95XX=y +CONFIG_BRCMFMAC=m +CONFIG_RT2X00=m +CONFIG_RT2800USB=m +CONFIG_INPUT_EVDEV=y +CONFIG_KEYBOARD_GPIO=y +CONFIG_KEYBOARD_TEGRA=y +CONFIG_KEYBOARD_SPEAR=y +CONFIG_KEYBOARD_CROS_EC=y +CONFIG_MOUSE_PS2_ELANTECH=y +CONFIG_INPUT_MPU3050=y +CONFIG_SERIO_AMBAKMI=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_SERIAL_8250_DW=y +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +CONFIG_SERIAL_SAMSUNG=y +CONFIG_SERIAL_SAMSUNG_CONSOLE=y +CONFIG_SERIAL_SIRFSOC=y +CONFIG_SERIAL_SIRFSOC_CONSOLE=y +CONFIG_SERIAL_TEGRA=y +CONFIG_SERIAL_IMX=y +CONFIG_SERIAL_IMX_CONSOLE=y +CONFIG_SERIAL_MSM=y +CONFIG_SERIAL_MSM_CONSOLE=y +CONFIG_SERIAL_VT8500=y +CONFIG_SERIAL_VT8500_CONSOLE=y +CONFIG_SERIAL_OF_PLATFORM=y +CONFIG_SERIAL_OMAP=y +CONFIG_SERIAL_OMAP_CONSOLE=y +CONFIG_SERIAL_XILINX_PS_UART=y +CONFIG_SERIAL_XILINX_PS_UART_CONSOLE=y +CONFIG_SERIAL_FSL_LPUART=y +CONFIG_SERIAL_FSL_LPUART_CONSOLE=y +CONFIG_SERIAL_ST_ASC=y +CONFIG_SERIAL_ST_ASC_CONSOLE=y +CONFIG_I2C_CHARDEV=y +CONFIG_I2C_MUX_PCA954x=y +CONFIG_I2C_MUX_PINCTRL=y +CONFIG_I2C_CADENCE=y +CONFIG_I2C_DESIGNWARE_PLATFORM=y +CONFIG_I2C_EXYNOS5=y +CONFIG_I2C_MV64XXX=y +CONFIG_I2C_SIRF=y +CONFIG_I2C_TEGRA=y +CONFIG_SPI=y +CONFIG_SPI_OMAP24XX=y +CONFIG_SPI_PL022=y +CONFIG_SPI_SIRF=y +CONFIG_SPI_SUN4I=y +CONFIG_SPI_SUN6I=y +CONFIG_SPI_TEGRA114=y +CONFIG_SPI_TEGRA20_SFLASH=y +CONFIG_SPI_TEGRA20_SLINK=y +CONFIG_PINCTRL_AS3722=y +CONFIG_PINCTRL_APQ8064=y +CONFIG_PINCTRL_PALMAS=y +CONFIG_GPIO_DWAPB=y +CONFIG_GPIO_PCA953X=y +CONFIG_GPIO_PCA953X_IRQ=y +CONFIG_GPIO_TWL4030=y +CONFIG_GPIO_PALMAS=y +CONFIG_GPIO_TPS6586X=y +CONFIG_GPIO_TPS65910=y +CONFIG_BATTERY_SBS=y +CONFIG_CHARGER_TPS65090=y +CONFIG_POWER_RESET_AS3722=y +CONFIG_POWER_RESET_GPIO=y +CONFIG_POWER_RESET_SUN6I=y +CONFIG_SENSORS_LM90=y +CONFIG_THERMAL=y +CONFIG_WATCHDOG=y +CONFIG_SUNXI_WATCHDOG=y +CONFIG_MFD_AS3722=y +CONFIG_MFD_BCM590XX=y +CONFIG_MFD_CROS_EC=y +CONFIG_MFD_CROS_EC_SPI=y +CONFIG_MFD_MAX8907=y +CONFIG_MFD_SEC_CORE=y +CONFIG_MFD_PALMAS=y +CONFIG_MFD_TPS65090=y +CONFIG_MFD_TPS6586X=y +CONFIG_MFD_TPS65910=y +CONFIG_REGULATOR_VIRTUAL_CONSUMER=y +CONFIG_REGULATOR_AB8500=y +CONFIG_REGULATOR_AS3722=y +CONFIG_REGULATOR_BCM590XX=y +CONFIG_REGULATOR_GPIO=y +CONFIG_REGULATOR_MAX8907=y +CONFIG_REGULATOR_PALMAS=y +CONFIG_REGULATOR_QCOM_RPM=y +CONFIG_REGULATOR_S2MPS11=y +CONFIG_REGULATOR_S5M8767=y +CONFIG_REGULATOR_TPS51632=y +CONFIG_REGULATOR_TPS62360=y +CONFIG_REGULATOR_TPS65090=y +CONFIG_REGULATOR_TPS6586X=y +CONFIG_REGULATOR_TPS65910=y +CONFIG_REGULATOR_TWL4030=y +CONFIG_REGULATOR_VEXPRESS=y +CONFIG_MEDIA_SUPPORT=y +CONFIG_MEDIA_CAMERA_SUPPORT=y +CONFIG_MEDIA_USB_SUPPORT=y +CONFIG_USB_VIDEO_CLASS=y +CONFIG_USB_GSPCA=y +CONFIG_DRM=y +# CONFIG_DRM_MSM_DSI is not set +CONFIG_DRM_TEGRA=y +CONFIG_DRM_PANEL_SIMPLE=y +CONFIG_FB_ARMCLCD=y +CONFIG_FB_WM8505=y +CONFIG_FB_SIMPLE=y +CONFIG_BACKLIGHT_LCD_SUPPORT=y +CONFIG_BACKLIGHT_PWM=y +CONFIG_SOUND=y +CONFIG_SND=y +CONFIG_SND_SOC=y +CONFIG_SND_SOC_TEGRA=y +CONFIG_SND_SOC_TEGRA_RT5640=y +CONFIG_SND_SOC_TEGRA_WM8753=y +CONFIG_SND_SOC_TEGRA_WM8903=y +CONFIG_SND_SOC_TEGRA_TRIMSLICE=y +CONFIG_SND_SOC_TEGRA_ALC5632=y +CONFIG_SND_SOC_TEGRA_MAX98090=y +CONFIG_USB=y +CONFIG_USB_XHCI_HCD=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_EHCI_MSM=y +CONFIG_USB_EHCI_TEGRA=y +CONFIG_USB_EHCI_HCD_PLATFORM=y +CONFIG_USB_ISP1760_HCD=y +CONFIG_USB_OHCI_HCD=y +CONFIG_USB_OHCI_HCD_PLATFORM=y +CONFIG_USB_STORAGE=y +CONFIG_USB_CHIPIDEA=y +CONFIG_USB_CHIPIDEA_HOST=y +CONFIG_AB8500_USB=y +CONFIG_SAMSUNG_USB2PHY=y +CONFIG_SAMSUNG_USB3PHY=y +CONFIG_USB_GPIO_VBUS=y +CONFIG_USB_ISP1301=y +CONFIG_USB_MSM_OTG=y +CONFIG_USB_MXS_PHY=y +CONFIG_MMC=y +CONFIG_MMC_BLOCK_MINORS=16 +CONFIG_MMC_ARMMMCI=y +CONFIG_MMC_SDHCI=y +CONFIG_MMC_SDHCI_PLTFM=y +CONFIG_MMC_SDHCI_OF_ARASAN=y +CONFIG_MMC_SDHCI_ESDHC_IMX=y +CONFIG_MMC_SDHCI_TEGRA=y +CONFIG_MMC_SDHCI_S3C=y +CONFIG_MMC_SDHCI_PXAV3=y +CONFIG_MMC_SDHCI_SPEAR=y +CONFIG_MMC_SDHCI_S3C_DMA=y +CONFIG_MMC_SDHCI_BCM_KONA=y +CONFIG_MMC_OMAP=y +CONFIG_MMC_OMAP_HS=y +CONFIG_MMC_DW=y +CONFIG_MMC_DW_EXYNOS=y +CONFIG_MMC_SUNXI=y +CONFIG_EDAC=y +CONFIG_EDAC_MM_EDAC=y +CONFIG_EDAC_HIGHBANK_MC=y +CONFIG_EDAC_HIGHBANK_L2=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_AS3722=y +CONFIG_RTC_DRV_DS1307=y +CONFIG_RTC_DRV_MAX8907=y +CONFIG_RTC_DRV_PALMAS=y +CONFIG_RTC_DRV_TWL4030=y +CONFIG_RTC_DRV_TPS6586X=y +CONFIG_RTC_DRV_TPS65910=y +CONFIG_RTC_DRV_EM3027=y +CONFIG_RTC_DRV_PL031=y +CONFIG_RTC_DRV_VT8500=y +CONFIG_RTC_DRV_SUNXI=y +CONFIG_RTC_DRV_TEGRA=y +CONFIG_DMADEVICES=y +CONFIG_DW_DMAC=y +CONFIG_TEGRA20_APB_DMA=y +CONFIG_STE_DMA40=y +CONFIG_SIRF_DMA=y +CONFIG_TI_EDMA=y +CONFIG_PL330_DMA=y +CONFIG_IMX_SDMA=y +CONFIG_IMX_DMA=y +CONFIG_MXS_DMA=y +CONFIG_DMA_OMAP=y +CONFIG_QCOM_BAM_DMA=y +CONFIG_STAGING=y +CONFIG_SENSORS_ISL29018=y +CONFIG_SENSORS_ISL29028=y +CONFIG_MFD_NVEC=y +CONFIG_KEYBOARD_NVEC=y +CONFIG_SERIO_NVEC_PS2=y +CONFIG_NVEC_POWER=y +CONFIG_QCOM_GSBI=y +CONFIG_QCOM_RPM=y +CONFIG_COMMON_CLK_QCOM=y +CONFIG_MSM_GCC_8660=y +CONFIG_MSM_MMCC_8960=y +CONFIG_MSM_MMCC_8974=y +CONFIG_TEGRA_IOMMU_GART=y +CONFIG_TEGRA_IOMMU_SMMU=y +CONFIG_MEMORY=y +CONFIG_IIO=y +CONFIG_AK8975=y +CONFIG_PWM=y +CONFIG_PWM_TEGRA=y +CONFIG_PWM_VT8500=y +CONFIG_OMAP_USB2=y +CONFIG_PHY_SUN4I_USB=y +CONFIG_PHY_QCOM_APQ8064_SATA=y +CONFIG_SQUASHFS=y +CONFIG_SQUASHFS_LZO=y +CONFIG_SQUASHFS_XZ=y +CONFIG_LOCKUP_DETECTOR=y diff --git a/linaro/configs/imx5.conf b/linaro/configs/imx5.conf new file mode 100644 index 000000000000..0196c572bb14 --- /dev/null +++ b/linaro/configs/imx5.conf @@ -0,0 +1,105 @@ +CONFIG_ARCH_MXC=y +CONFIG_MXC_PWM=y +CONFIG_SOC_IMX6Q=y +CONFIG_HIGHMEM=y +CONFIG_COMPACTION=y +CONFIG_CMDLINE="noinitrd console=ttymxc0,115200 root=/dev/nfs nfsroot=192.168.0.101:/shared/nfs ip=dhcp" +CONFIG_CFG80211=y +CONFIG_MAC80211=y +CONFIG_MAC80211_RC_PID=y +CONFIG_MAC80211_RC_DEFAULT_PID=y +CONFIG_BLK_DEV_LOOP=y +CONFIG_SCSI=y +CONFIG_BLK_DEV_SD=y +CONFIG_SCSI_MULTI_LUN=y +CONFIG_SCSI_SCAN_ASYNC=y +CONFIG_SMC91X=y +CONFIG_MARVELL_PHY=y +CONFIG_DAVICOM_PHY=y +CONFIG_QSEMI_PHY=y +CONFIG_LXT_PHY=y +CONFIG_CICADA_PHY=y +CONFIG_VITESSE_PHY=y +CONFIG_SMSC_PHY=y +CONFIG_BROADCOM_PHY=y +CONFIG_ICPLUS_PHY=y +CONFIG_REALTEK_PHY=y +CONFIG_NATIONAL_PHY=y +CONFIG_STE10XP=y +CONFIG_LSI_ET1011C_PHY=y +CONFIG_MDIO_BITBANG=y +CONFIG_MDIO_GPIO=y +# CONFIG_WLAN is not set +CONFIG_INPUT_FF_MEMLESS=y +CONFIG_INPUT_POLLDEV=y +# CONFIG_INPUT_MOUSEDEV_PSAUX is not set +CONFIG_INPUT_EVDEV=y +CONFIG_INPUT_EVBUG=y +CONFIG_KEYBOARD_GPIO=y +CONFIG_KEYBOARD_MPR121=y +CONFIG_MOUSE_PS2=y +CONFIG_MOUSE_PS2_ELANTECH=y +CONFIG_INPUT_TOUCHSCREEN=y +CONFIG_SERIO_SERPORT=y +CONFIG_VT_HW_CONSOLE_BINDING=y +# CONFIG_LEGACY_PTYS is not set +CONFIG_SERIAL_IMX=y +CONFIG_SERIAL_IMX_CONSOLE=y +CONFIG_HW_RANDOM=y +CONFIG_I2C=y +# CONFIG_I2C_COMPAT is not set +CONFIG_I2C_CHARDEV=y +# CONFIG_I2C_HELPER_AUTO is not set +CONFIG_I2C_ALGOBIT=y +CONFIG_I2C_ALGOPCF=y +CONFIG_I2C_ALGOPCA=y +CONFIG_I2C_IMX=y +CONFIG_SPI=y +CONFIG_SPI_IMX=y +CONFIG_GPIO_SYSFS=y +CONFIG_POWER_SUPPLY=y +CONFIG_POWER_SUPPLY_DEBUG=y +CONFIG_TEST_POWER=y +CONFIG_MFD_MC13XXX=y +CONFIG_REGULATOR=y +CONFIG_REGULATOR_FIXED_VOLTAGE=y +CONFIG_REGULATOR_MC13892=y +CONFIG_MEDIA_SUPPORT=y +CONFIG_VIDEO_DEV=y +CONFIG_USB_VIDEO_CLASS=y +CONFIG_FB=y +CONFIG_FB_MODE_HELPERS=y +CONFIG_BACKLIGHT_LCD_SUPPORT=y +CONFIG_LCD_CLASS_DEVICE=y +CONFIG_BACKLIGHT_CLASS_DEVICE=y +# CONFIG_BACKLIGHT_GENERIC is not set +CONFIG_BACKLIGHT_PWM=y +CONFIG_FONTS=y +CONFIG_FONT_8x16=y +CONFIG_SOUND=y +CONFIG_SND=y +# CONFIG_SND_DRIVERS is not set +# CONFIG_SND_ARM is not set +# CONFIG_SND_SPI is not set +# CONFIG_SND_USB is not set +CONFIG_SND_SOC=y +CONFIG_SND_IMX_SOC=y +CONFIG_HIDRAW=y +CONFIG_HID_PID=y +CONFIG_USB_HIDDEV=y +CONFIG_USB=y +# CONFIG_USB_OTG_WHITELIST is not set +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_EHCI_MXC=y +CONFIG_USB_STORAGE=y +CONFIG_USB_GADGET=y +CONFIG_MMC=y +CONFIG_MMC_SDHCI=y +CONFIG_MMC_SDHCI_PLTFM=y +CONFIG_MMC_SDHCI_ESDHC_IMX=y +CONFIG_NEW_LEDS=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_INTF_DEV_UIE_EMUL=y +CONFIG_DMADEVICES=y +# CONFIG_DEBUG_BUGVERBOSE is not set +CONFIG_DEBUG_INFO=y diff --git a/linaro/configs/kvm-guest.conf b/linaro/configs/kvm-guest.conf new file mode 100644 index 000000000000..cf174f89d043 --- /dev/null +++ b/linaro/configs/kvm-guest.conf @@ -0,0 +1,17 @@ +CONFIG_9P_FS=y +CONFIG_NET_9P_VIRTIO=y +CONFIG_NET_9P=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_OF_PLATFORM=y +CONFIG_BALLOON_COMPACTION=y +CONFIG_VIRTIO_BLK=y +CONFIG_VIRTIO_NET=y +CONFIG_HVC_DRIVER=y +CONFIG_VIRTIO_CONSOLE=y +CONFIG_VIRTIO=y +CONFIG_VIRTIO_BALLOON=y +CONFIG_VIRTIO_MMIO=y +CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y +CONFIG_VIRTUALIZATION=y +# CONFIG_THUMB2_KERNEL is not set diff --git a/linaro/configs/kvm-host.conf b/linaro/configs/kvm-host.conf new file mode 100644 index 000000000000..d0fb67b35175 --- /dev/null +++ b/linaro/configs/kvm-host.conf @@ -0,0 +1,13 @@ +CONFIG_VIRTUALIZATION=y +CONFIG_ARM_LPAE=y +CONFIG_ARM_VIRT_EXT=y +CONFIG_HAVE_KVM_IRQCHIP=y +CONFIG_KVM_ARM_HOST=y +CONFIG_KVM_ARM_MAX_VCPUS=4 +CONFIG_KVM_ARM_TIMER=y +CONFIG_KVM_ARM_VGIC=y +CONFIG_KVM_MMIO=y +CONFIG_KVM=y +CONFIG_BLK_DEV_NBD=m +CONFIG_BRIDGE=m +CONFIG_TUN=y diff --git a/linaro/configs/linaro-base-arm.conf b/linaro/configs/linaro-base-arm.conf new file mode 100644 index 000000000000..70f771bf2c47 --- /dev/null +++ b/linaro/configs/linaro-base-arm.conf @@ -0,0 +1,3 @@ +CONFIG_THUMB2_KERNEL=y +CONFIG_AEABI=y +# CONFIG_OABI_COMPAT is not set diff --git a/linaro/configs/linaro-base-arm64.conf b/linaro/configs/linaro-base-arm64.conf new file mode 100644 index 000000000000..4ce12fd20c03 --- /dev/null +++ b/linaro/configs/linaro-base-arm64.conf @@ -0,0 +1,3 @@ +CONFIG_TRANSPARENT_HUGEPAGE=y +CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y +CONFIG_HUGETLBFS=y diff --git a/linaro/configs/linaro-base.conf b/linaro/configs/linaro-base.conf new file mode 100644 index 000000000000..1988c42cedea --- /dev/null +++ b/linaro/configs/linaro-base.conf @@ -0,0 +1,115 @@ +CONFIG_SYSVIPC=y +CONFIG_POSIX_MQUEUE=y +CONFIG_BSD_PROCESS_ACCT=y +CONFIG_IKCONFIG=y +CONFIG_IKCONFIG_PROC=y +CONFIG_LOG_BUF_SHIFT=16 +CONFIG_BLK_DEV_INITRD=y +CONFIG_EMBEDDED=y +CONFIG_PERF_EVENTS=y +CONFIG_SLAB=y +CONFIG_PROFILING=y +CONFIG_OPROFILE=y +CONFIG_MODULES=y +CONFIG_MODULE_UNLOAD=y +CONFIG_NO_HZ=y +CONFIG_HIGH_RES_TIMERS=y +CONFIG_SMP=y +CONFIG_SCHED_MC=y +CONFIG_SCHED_SMT=y +CONFIG_THUMB2_KERNEL=y +CONFIG_AEABI=y +# CONFIG_OABI_COMPAT is not set +CONFIG_CPU_FREQ=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y +CONFIG_CPU_IDLE=y +CONFIG_BINFMT_MISC=y +CONFIG_MD=y +CONFIG_BLK_DEV_DM=y +CONFIG_NET=y +CONFIG_PACKET=y +CONFIG_UNIX=y +CONFIG_XFRM_USER=y +CONFIG_NET_KEY=y +CONFIG_NET_KEY_MIGRATE=y +CONFIG_INET=y +CONFIG_IP_MULTICAST=y +CONFIG_IP_PNP=y +CONFIG_IP_PNP_DHCP=y +CONFIG_IP_PNP_BOOTP=y +CONFIG_IP_PNP_RARP=y +# CONFIG_INET_LRO is not set +CONFIG_NETFILTER=y +CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug" +CONFIG_CONNECTOR=y +CONFIG_MTD=y +CONFIG_MTD_CMDLINE_PARTS=y +CONFIG_MTD_BLOCK=y +CONFIG_MTD_OOPS=y +CONFIG_MTD_CFI=y +CONFIG_MTD_CFI_INTELEXT=y +CONFIG_MTD_NAND=y +CONFIG_NETDEVICES=y +CONFIG_GPIO_SYSFS=y +CONFIG_EXT2_FS=y +CONFIG_EXT3_FS=y +CONFIG_EXT4_FS=y +CONFIG_BTRFS_FS=y +CONFIG_QUOTA=y +CONFIG_QFMT_V2=y +CONFIG_MSDOS_FS=y +CONFIG_VFAT_FS=y +CONFIG_TMPFS=y +CONFIG_ECRYPT_FS=y +CONFIG_JFFS2_FS=y +CONFIG_JFFS2_SUMMARY=y +CONFIG_JFFS2_FS_XATTR=y +CONFIG_JFFS2_COMPRESSION_OPTIONS=y +CONFIG_JFFS2_LZO=y +CONFIG_JFFS2_RUBIN=y +CONFIG_CRAMFS=y +CONFIG_NETWORK_FILESYSTEMS=y +CONFIG_NFS_FS=y +CONFIG_NFS_V3=y +CONFIG_NFS_V3_ACL=y +CONFIG_NFS_V4=y +CONFIG_ROOT_NFS=y +CONFIG_NLS_CODEPAGE_437=y +CONFIG_NLS_ISO8859_1=y +CONFIG_PRINTK_TIME=y +CONFIG_MAGIC_SYSRQ=y +CONFIG_DEBUG_FS=y +CONFIG_SCHEDSTATS=y +CONFIG_TIMER_STATS=y +CONFIG_DEBUG_INFO=y +CONFIG_KEYS=y +CONFIG_CRYPTO_MICHAEL_MIC=y +CONFIG_CRC_CCITT=y +CONFIG_CRC_T10DIF=y +CONFIG_CRC_ITU_T=y +CONFIG_CRC7=y +CONFIG_HW_PERF_EVENTS=y +CONFIG_FUNCTION_TRACER=y +CONFIG_ENABLE_DEFAULT_TRACERS=y +CONFIG_PROC_DEVICETREE=y +CONFIG_AUDIT=y +CONFIG_SECURITY=y +CONFIG_SECURITY_NETWORK=y +CONFIG_SECURITY_SELINUX=y +CONFIG_EXT2_FS_SECURITY=y +CONFIG_EXT3_FS_SECURITY=y +CONFIG_EXT4_FS_SECURITY=y +CONFIG_JFS_SECURITY=y +CONFIG_XFS_SECURITY=y +CONFIG_JFFS2_FS_SECURITY=y +CONFIG_SECURITY_NETWORK_XFRM=y +CONFIG_NETLABEL=y +CONFIG_SECURITY_SELINUX_BOOTPARAM=y +CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1 +CONFIG_SECURITY_SELINUX_DISABLE=y +CONFIG_SECURITY_SELINUX_DEVELOP=y +CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1 +CONFIG_SECURITY_SELINUX_ENABLE_SECMARK_DEFAULT=y +CONFIG_SECURITY_SELINUX_AVC_STATS=y +CONFIG_SECURITY_CAPABILITIES=y +CONFIG_SECURITY_FILE_CAPABILITIES=y diff --git a/linaro/configs/linaro-base64.conf b/linaro/configs/linaro-base64.conf new file mode 120000 index 000000000000..a5af95de3e5a --- /dev/null +++ b/linaro/configs/linaro-base64.conf @@ -0,0 +1 @@ +linaro-base-arm64.conf
\ No newline at end of file diff --git a/linaro/configs/lt-arndale.conf b/linaro/configs/lt-arndale.conf new file mode 100644 index 000000000000..c54e0dcd0dcc --- /dev/null +++ b/linaro/configs/lt-arndale.conf @@ -0,0 +1 @@ +CONFIG_EXTRA_FIRMWARE="edid-1920x1080.fw" diff --git a/linaro/configs/lt-arndale_octa.conf b/linaro/configs/lt-arndale_octa.conf new file mode 100644 index 000000000000..c54e0dcd0dcc --- /dev/null +++ b/linaro/configs/lt-arndale_octa.conf @@ -0,0 +1 @@ +CONFIG_EXTRA_FIRMWARE="edid-1920x1080.fw" diff --git a/linaro/configs/multi_v7.conf b/linaro/configs/multi_v7.conf new file mode 100644 index 000000000000..682be962a196 --- /dev/null +++ b/linaro/configs/multi_v7.conf @@ -0,0 +1,151 @@ +CONFIG_IRQ_DOMAIN_DEBUG=y +CONFIG_ARCH_MVEBU=y +CONFIG_MACH_ARMADA_370=y +CONFIG_MACH_ARMADA_XP=y +CONFIG_ARCH_BCM=y +CONFIG_GPIO_PCA953X=y +CONFIG_ARCH_HIGHBANK=y +CONFIG_ARCH_KEYSTONE=y +CONFIG_ARCH_MXC=y +CONFIG_MACH_IMX51_DT=y +CONFIG_SOC_IMX53=y +CONFIG_SOC_IMX6Q=y +CONFIG_SOC_IMX6SL=y +CONFIG_SOC_VF610=y +CONFIG_ARCH_OMAP3=y +CONFIG_ARCH_OMAP4=y +CONFIG_SOC_OMAP5=y +CONFIG_SOC_AM33XX=y +CONFIG_SOC_AM43XX=y +CONFIG_ARCH_ROCKCHIP=y +CONFIG_ARCH_SOCFPGA=y +CONFIG_PLAT_SPEAR=y +CONFIG_ARCH_SPEAR13XX=y +CONFIG_MACH_SPEAR1310=y +CONFIG_MACH_SPEAR1340=y +CONFIG_ARCH_STI=y +CONFIG_ARCH_SUNXI=y +CONFIG_ARCH_SIRF=y +CONFIG_ARCH_TEGRA=y +CONFIG_ARCH_TEGRA_2x_SOC=y +CONFIG_ARCH_TEGRA_3x_SOC=y +CONFIG_ARCH_TEGRA_114_SOC=y +CONFIG_TEGRA_EMC_SCALING_ENABLE=y +CONFIG_ARCH_U8500=y +CONFIG_MACH_HREFV60=y +CONFIG_MACH_SNOWBALL=y +CONFIG_MACH_UX500_DT=y +CONFIG_ARCH_VEXPRESS=y +CONFIG_ARCH_VEXPRESS_CA9X4=y +CONFIG_ARCH_VIRT=y +CONFIG_ARCH_WM8850=y +CONFIG_ARCH_ZYNQ=y +CONFIG_HIGHPTE=y +CONFIG_ARM_APPENDED_DTB=y +CONFIG_ARM_ATAG_DTB_COMPAT=y +CONFIG_OMAP_OCP2SCP=y +CONFIG_BLK_DEV_SD=y +CONFIG_ATA=y +CONFIG_SATA_AHCI_PLATFORM=y +CONFIG_SATA_HIGHBANK=y +CONFIG_SATA_MV=y +CONFIG_SUN4I_EMAC=y +CONFIG_NET_CALXEDA_XGMAC=y +CONFIG_KS8851=y +CONFIG_SMSC911X=y +CONFIG_STMMAC_ETH=y +CONFIG_KEYBOARD_SPEAR=y +CONFIG_SERIO_AMBAKMI=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_SERIAL_8250_DW=y +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +CONFIG_SERIAL_SIRFSOC=y +CONFIG_SERIAL_SIRFSOC_CONSOLE=y +CONFIG_SERIAL_TEGRA=y +CONFIG_SERIAL_IMX=y +CONFIG_SERIAL_IMX_CONSOLE=y +CONFIG_SERIAL_VT8500=y +CONFIG_SERIAL_VT8500_CONSOLE=y +CONFIG_SERIAL_OF_PLATFORM=y +CONFIG_SERIAL_OMAP=y +CONFIG_SERIAL_OMAP_CONSOLE=y +CONFIG_SERIAL_XILINX_PS_UART=y +CONFIG_SERIAL_XILINX_PS_UART_CONSOLE=y +CONFIG_SERIAL_FSL_LPUART=y +CONFIG_SERIAL_FSL_LPUART_CONSOLE=y +CONFIG_I2C_DESIGNWARE_PLATFORM=y +CONFIG_I2C_SIRF=y +CONFIG_I2C_TEGRA=y +CONFIG_SPI=y +CONFIG_SPI_OMAP24XX=y +CONFIG_SPI_PL022=y +CONFIG_SPI_SIRF=y +CONFIG_SPI_TEGRA114=y +CONFIG_SPI_TEGRA20_SLINK=y +CONFIG_PINCTRL_SINGLE=y +CONFIG_GPIO_GENERIC_PLATFORM=y +CONFIG_GPIO_TWL4030=y +CONFIG_REGULATOR_AB8500=y +CONFIG_REGULATOR_GPIO=y +CONFIG_REGULATOR_TPS51632=y +CONFIG_REGULATOR_TPS62360=y +CONFIG_REGULATOR_TWL4030=y +CONFIG_REGULATOR_VEXPRESS=y +CONFIG_DRM=y +CONFIG_TEGRA_HOST1X=y +CONFIG_DRM_TEGRA=y +CONFIG_FB_ARMCLCD=y +CONFIG_FB_WM8505=y +CONFIG_FB_SIMPLE=y +CONFIG_USB=y +CONFIG_USB_XHCI_HCD=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_EHCI_TEGRA=y +CONFIG_USB_EHCI_HCD_PLATFORM=y +CONFIG_USB_ISP1760_HCD=y +CONFIG_USB_STORAGE=y +CONFIG_USB_CHIPIDEA=y +CONFIG_USB_CHIPIDEA_HOST=y +CONFIG_AB8500_USB=y +CONFIG_OMAP_USB2=y +CONFIG_OMAP_USB3=y +CONFIG_SAMSUNG_USB2PHY=y +CONFIG_SAMSUNG_USB3PHY=y +CONFIG_USB_GPIO_VBUS=y +CONFIG_USB_ISP1301=y +CONFIG_USB_MXS_PHY=y +CONFIG_MMC=y +CONFIG_MMC_ARMMMCI=y +CONFIG_MMC_SDHCI=y +CONFIG_MMC_SDHCI_PLTFM=y +CONFIG_MMC_SDHCI_ESDHC_IMX=y +CONFIG_MMC_SDHCI_TEGRA=y +CONFIG_MMC_SDHCI_SPEAR=y +CONFIG_MMC_OMAP=y +CONFIG_MMC_OMAP_HS=y +CONFIG_EDAC=y +CONFIG_EDAC_MM_EDAC=y +CONFIG_EDAC_HIGHBANK_MC=y +CONFIG_EDAC_HIGHBANK_L2=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_TWL4030=y +CONFIG_RTC_DRV_PL031=y +CONFIG_RTC_DRV_VT8500=y +CONFIG_RTC_DRV_TEGRA=y +CONFIG_DMADEVICES=y +CONFIG_DW_DMAC=y +CONFIG_TEGRA20_APB_DMA=y +CONFIG_STE_DMA40=y +CONFIG_SIRF_DMA=y +CONFIG_TI_EDMA=y +CONFIG_PL330_DMA=y +CONFIG_IMX_SDMA=y +CONFIG_IMX_DMA=y +CONFIG_MXS_DMA=y +CONFIG_DMA_OMAP=y +CONFIG_PWM=y +CONFIG_PWM_VT8500=y +CONFIG_DEBUG_KERNEL=y +CONFIG_LOCKUP_DETECTOR=y diff --git a/linaro/configs/netns.conf b/linaro/configs/netns.conf new file mode 100644 index 000000000000..7fadd7d67e36 --- /dev/null +++ b/linaro/configs/netns.conf @@ -0,0 +1,2 @@ +CONFIG_NAMESPACES=y +CONFIG_VETH=y diff --git a/linaro/configs/null.conf b/linaro/configs/null.conf new file mode 100644 index 000000000000..1f71535af70b --- /dev/null +++ b/linaro/configs/null.conf @@ -0,0 +1 @@ +### null config just for testing multiple config frags diff --git a/linaro/configs/omap2plus.conf b/linaro/configs/omap2plus.conf new file mode 100644 index 000000000000..669c5abf6454 --- /dev/null +++ b/linaro/configs/omap2plus.conf @@ -0,0 +1,205 @@ +CONFIG_EXPERT=y +CONFIG_KPROBES=y +CONFIG_MODULE_FORCE_LOAD=y +CONFIG_MODULE_FORCE_UNLOAD=y +CONFIG_MODVERSIONS=y +CONFIG_MODULE_SRCVERSION_ALL=y +# CONFIG_BLK_DEV_BSG is not set +CONFIG_PARTITION_ADVANCED=y +CONFIG_ARCH_MULTI_V6=y +CONFIG_OMAP_RESET_CLOCKS=y +CONFIG_OMAP_MUX_DEBUG=y +CONFIG_ARCH_OMAP2=y +CONFIG_ARCH_OMAP3=y +CONFIG_ARCH_OMAP4=y +CONFIG_SOC_OMAP5=y +CONFIG_SOC_AM33XX=y +CONFIG_SOC_DRA7XX=y +CONFIG_ARM_THUMBEE=y +CONFIG_ARM_ERRATA_411920=y +CONFIG_NR_CPUS=2 +CONFIG_CMA=y +CONFIG_ZBOOT_ROM_TEXT=0x0 +CONFIG_ZBOOT_ROM_BSS=0x0 +CONFIG_ARM_APPENDED_DTB=y +CONFIG_ARM_ATAG_DTB_COMPAT=y +CONFIG_CMDLINE="root=/dev/mmcblk0p2 rootwait console=ttyO2,115200" +CONFIG_KEXEC=y +CONFIG_FPE_NWFPE=y +CONFIG_PM_DEBUG=y +CONFIG_IPV6=y +CONFIG_CAN=m +CONFIG_CAN_C_CAN=m +CONFIG_CAN_C_CAN_PLATFORM=m +CONFIG_BT=m +CONFIG_BT_HCIUART=m +CONFIG_BT_HCIUART_H4=y +CONFIG_BT_HCIUART_BCSP=y +CONFIG_BT_HCIUART_LL=y +CONFIG_BT_HCIBCM203X=m +CONFIG_BT_HCIBPA10X=m +CONFIG_CFG80211=m +CONFIG_MAC80211=m +CONFIG_MAC80211_RC_PID=y +CONFIG_MAC80211_RC_DEFAULT_PID=y +CONFIG_DEVTMPFS=y +CONFIG_DEVTMPFS_MOUNT=y +CONFIG_DMA_CMA=y +CONFIG_MTD_NAND_OMAP2=y +CONFIG_MTD_ONENAND=y +CONFIG_MTD_ONENAND_VERIFY_WRITE=y +CONFIG_MTD_ONENAND_OMAP2=y +CONFIG_MTD_UBI=y +CONFIG_BLK_DEV_LOOP=y +CONFIG_BLK_DEV_RAM=y +CONFIG_BLK_DEV_RAM_SIZE=65536 +CONFIG_SENSORS_TSL2550=m +CONFIG_BMP085_I2C=m +CONFIG_SENSORS_LIS3_I2C=m +CONFIG_SCSI=y +CONFIG_BLK_DEV_SD=y +CONFIG_SCSI_MULTI_LUN=y +CONFIG_SCSI_SCAN_ASYNC=y +CONFIG_KS8851=y +CONFIG_KS8851_MLL=y +CONFIG_SMC91X=y +CONFIG_SMSC911X=y +CONFIG_TI_CPSW=y +CONFIG_AT803X_PHY=y +CONFIG_SMSC_PHY=y +CONFIG_USB_USBNET=y +CONFIG_USB_NET_SMSC95XX=y +CONFIG_USB_ALI_M5632=y +CONFIG_USB_AN2720=y +CONFIG_USB_EPSON2888=y +CONFIG_USB_KC2190=y +CONFIG_LIBERTAS=m +CONFIG_LIBERTAS_USB=m +CONFIG_LIBERTAS_SDIO=m +CONFIG_LIBERTAS_DEBUG=y +CONFIG_INPUT_JOYDEV=y +CONFIG_INPUT_EVDEV=y +CONFIG_KEYBOARD_GPIO=y +CONFIG_KEYBOARD_MATRIX=m +CONFIG_KEYBOARD_TWL4030=y +CONFIG_INPUT_TOUCHSCREEN=y +CONFIG_TOUCHSCREEN_ADS7846=y +CONFIG_INPUT_MISC=y +CONFIG_INPUT_TWL4030_PWRBUTTON=y +# CONFIG_LEGACY_PTYS is not set +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_SERIAL_8250_NR_UARTS=32 +CONFIG_SERIAL_8250_EXTENDED=y +CONFIG_SERIAL_8250_MANY_PORTS=y +CONFIG_SERIAL_8250_SHARE_IRQ=y +CONFIG_SERIAL_8250_DETECT_IRQ=y +CONFIG_SERIAL_8250_RSA=y +CONFIG_SERIAL_OMAP=y +CONFIG_SERIAL_OMAP_CONSOLE=y +CONFIG_HW_RANDOM=y +CONFIG_I2C_CHARDEV=y +CONFIG_SPI=y +CONFIG_SPI_OMAP24XX=y +CONFIG_PINCTRL_SINGLE=y +CONFIG_DEBUG_GPIO=y +CONFIG_GPIO_SYSFS=y +CONFIG_GPIO_TWL4030=y +CONFIG_W1=y +CONFIG_POWER_SUPPLY=y +CONFIG_SENSORS_LM75=m +CONFIG_THERMAL=y +CONFIG_THERMAL_GOV_FAIR_SHARE=y +CONFIG_THERMAL_GOV_USER_SPACE=y +CONFIG_TI_SOC_THERMAL=y +CONFIG_OMAP4_THERMAL=y +CONFIG_OMAP5_THERMAL=y +CONFIG_DRA752_THERMAL=y +CONFIG_WATCHDOG=y +CONFIG_OMAP_WATCHDOG=y +CONFIG_TWL4030_WATCHDOG=y +CONFIG_MFD_TPS65217=y +CONFIG_MFD_TPS65910=y +CONFIG_TWL6040_CORE=y +CONFIG_REGULATOR_TPS65023=y +CONFIG_REGULATOR_TPS6507X=y +CONFIG_REGULATOR_TPS65217=y +CONFIG_REGULATOR_TPS65910=y +CONFIG_REGULATOR_TWL4030=y +CONFIG_FB=y +CONFIG_FIRMWARE_EDID=y +CONFIG_FB_MODE_HELPERS=y +CONFIG_FB_TILEBLITTING=y +CONFIG_OMAP2_DSS=m +CONFIG_OMAP2_DSS_SDI=y +CONFIG_OMAP2_DSS_DSI=y +CONFIG_FB_OMAP2=m +CONFIG_DISPLAY_ENCODER_TFP410=m +CONFIG_DISPLAY_ENCODER_TPD12S015=m +CONFIG_DISPLAY_CONNECTOR_DVI=m +CONFIG_DISPLAY_CONNECTOR_HDMI=m +CONFIG_DISPLAY_PANEL_DPI=m +CONFIG_BACKLIGHT_LCD_SUPPORT=y +CONFIG_LCD_CLASS_DEVICE=y +CONFIG_LCD_PLATFORM=y +CONFIG_FRAMEBUFFER_CONSOLE=y +CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y +CONFIG_LOGO=y +CONFIG_SOUND=m +CONFIG_SND=m +CONFIG_SND_MIXER_OSS=m +CONFIG_SND_PCM_OSS=m +CONFIG_SND_VERBOSE_PRINTK=y +CONFIG_SND_DEBUG=y +CONFIG_SND_USB_AUDIO=m +CONFIG_SND_SOC=m +CONFIG_SND_OMAP_SOC=m +CONFIG_SND_OMAP_SOC_OMAP_TWL4030=m +CONFIG_SND_OMAP_SOC_OMAP_ABE_TWL6040=m +CONFIG_SND_OMAP_SOC_OMAP3_PANDORA=m +CONFIG_USB=y +CONFIG_USB_DEBUG=y +CONFIG_USB_ANNOUNCE_NEW_DEVICES=y +CONFIG_USB_MON=y +CONFIG_USB_WDM=y +CONFIG_USB_STORAGE=y +CONFIG_USB_TEST=y +CONFIG_NOP_USB_XCEIV=y +CONFIG_USB_GADGET=y +CONFIG_USB_GADGET_DEBUG=y +CONFIG_USB_GADGET_DEBUG_FILES=y +CONFIG_USB_GADGET_DEBUG_FS=y +CONFIG_USB_ZERO=m +CONFIG_MMC=y +CONFIG_MMC_UNSAFE_RESUME=y +CONFIG_SDIO_UART=y +CONFIG_MMC_OMAP=y +CONFIG_MMC_OMAP_HS=y +CONFIG_NEW_LEDS=y +CONFIG_LEDS_GPIO=y +CONFIG_LEDS_TRIGGERS=y +CONFIG_LEDS_TRIGGER_TIMER=y +CONFIG_LEDS_TRIGGER_ONESHOT=y +CONFIG_LEDS_TRIGGER_HEARTBEAT=y +CONFIG_LEDS_TRIGGER_BACKLIGHT=y +CONFIG_LEDS_TRIGGER_CPU=y +CONFIG_LEDS_TRIGGER_GPIO=y +CONFIG_LEDS_TRIGGER_DEFAULT_ON=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_TWL92330=y +CONFIG_RTC_DRV_TWL4030=y +CONFIG_RTC_DRV_OMAP=y +CONFIG_DMADEVICES=y +CONFIG_TI_EDMA=y +CONFIG_DMA_OMAP=y +# CONFIG_EXT3_FS_XATTR is not set +CONFIG_UBIFS_FS=y +CONFIG_DEBUG_INFO=y +CONFIG_PROVE_LOCKING=y +# CONFIG_DEBUG_BUGVERBOSE is not set +CONFIG_SECURITY=y +# CONFIG_CRYPTO_ANSI_CPRNG is not set +CONFIG_LIBCRC32C=y +CONFIG_FONTS=y +CONFIG_FONT_8x8=y +CONFIG_FONT_8x16=y diff --git a/linaro/configs/omap4.conf b/linaro/configs/omap4.conf new file mode 100644 index 000000000000..50fb9d9cb5b5 --- /dev/null +++ b/linaro/configs/omap4.conf @@ -0,0 +1,194 @@ +CONFIG_EXPERT=y +CONFIG_KPROBES=y +CONFIG_MODULE_FORCE_LOAD=y +CONFIG_MODULE_FORCE_UNLOAD=y +CONFIG_MODVERSIONS=y +CONFIG_MODULE_SRCVERSION_ALL=y +# CONFIG_BLK_DEV_BSG is not set +CONFIG_PARTITION_ADVANCED=y +CONFIG_GPIO_PCA953X=y +CONFIG_OMAP_RESET_CLOCKS=y +CONFIG_OMAP_MUX_DEBUG=y +CONFIG_ARCH_OMAP2PLUS=y +CONFIG_SOC_OMAP5=y +# CONFIG_ARCH_OMAP2 is not set +CONFIG_ARCH_VEXPRESS_CA9X4=y +CONFIG_ARM_THUMBEE=y +CONFIG_ARM_ERRATA_411920=y +CONFIG_NR_CPUS=2 +CONFIG_ZBOOT_ROM_TEXT=0x0 +CONFIG_ZBOOT_ROM_BSS=0x0 +CONFIG_CMDLINE="root=/dev/mmcblk0p2 rootwait console=ttyO2,115200" +CONFIG_KEXEC=y +CONFIG_PM_DEBUG=y +CONFIG_CAN=m +CONFIG_CAN_C_CAN=m +CONFIG_CAN_C_CAN_PLATFORM=m +CONFIG_BT=m +CONFIG_BT_HCIUART=m +CONFIG_BT_HCIUART_H4=y +CONFIG_BT_HCIUART_BCSP=y +CONFIG_BT_HCIUART_LL=y +CONFIG_BT_HCIBCM203X=m +CONFIG_BT_HCIBPA10X=m +CONFIG_CFG80211=m +CONFIG_MAC80211=m +CONFIG_MAC80211_RC_PID=y +CONFIG_MAC80211_RC_DEFAULT_PID=y +CONFIG_CMA=y +CONFIG_MTD_NAND_OMAP2=y +CONFIG_MTD_ONENAND=y +CONFIG_MTD_ONENAND_VERIFY_WRITE=y +CONFIG_MTD_ONENAND_OMAP2=y +CONFIG_MTD_UBI=y +CONFIG_BLK_DEV_LOOP=y +CONFIG_BLK_DEV_RAM_SIZE=16384 +CONFIG_SENSORS_TSL2550=m +CONFIG_SENSORS_LIS3_I2C=m +CONFIG_SCSI=y +CONFIG_BLK_DEV_SD=y +CONFIG_SCSI_MULTI_LUN=y +CONFIG_SCSI_SCAN_ASYNC=y +CONFIG_KS8851=y +CONFIG_KS8851_MLL=y +CONFIG_SMC91X=y +CONFIG_SMSC911X=y +CONFIG_TI_CPSW=y +CONFIG_SMSC_PHY=y +CONFIG_USB_USBNET=y +CONFIG_USB_NET_SMSC95XX=y +CONFIG_USB_ALI_M5632=y +CONFIG_USB_AN2720=y +CONFIG_USB_EPSON2888=y +CONFIG_USB_KC2190=y +CONFIG_LIBERTAS=m +CONFIG_LIBERTAS_USB=m +CONFIG_LIBERTAS_SDIO=m +CONFIG_LIBERTAS_DEBUG=y +CONFIG_INPUT_JOYDEV=y +CONFIG_INPUT_EVDEV=y +CONFIG_KEYBOARD_GPIO=y +CONFIG_KEYBOARD_MATRIX=m +CONFIG_KEYBOARD_TWL4030=y +CONFIG_INPUT_TOUCHSCREEN=y +CONFIG_TOUCHSCREEN_ADS7846=y +CONFIG_INPUT_TWL4030_PWRBUTTON=y +CONFIG_VT_HW_CONSOLE_BINDING=y +# CONFIG_LEGACY_PTYS is not set +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_8250_CONSOLE=y +CONFIG_SERIAL_8250_NR_UARTS=32 +CONFIG_SERIAL_8250_EXTENDED=y +CONFIG_SERIAL_8250_MANY_PORTS=y +CONFIG_SERIAL_8250_SHARE_IRQ=y +CONFIG_SERIAL_8250_DETECT_IRQ=y +CONFIG_SERIAL_8250_RSA=y +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +CONFIG_SERIAL_OMAP=y +CONFIG_SERIAL_OMAP_CONSOLE=y +CONFIG_HW_RANDOM=y +CONFIG_I2C_CHARDEV=y +CONFIG_SPI=y +CONFIG_SPI_OMAP24XX=y +CONFIG_PINCTRL_SINGLE=y +CONFIG_DEBUG_GPIO=y +CONFIG_GPIO_SYSFS=y +CONFIG_GPIO_TWL4030=y +CONFIG_W1=y +CONFIG_SENSORS_LM75=m +CONFIG_WATCHDOG=y +CONFIG_OMAP_WATCHDOG=y +CONFIG_TWL4030_WATCHDOG=y +CONFIG_MFD_TPS65217=y +CONFIG_MFD_TPS65910=y +CONFIG_TWL6040_CORE=y +CONFIG_REGULATOR_TPS65023=y +CONFIG_REGULATOR_TPS6507X=y +CONFIG_REGULATOR_TPS65217=y +CONFIG_REGULATOR_TPS65910=y +CONFIG_REGULATOR_TWL4030=y +CONFIG_FB=y +CONFIG_FIRMWARE_EDID=y +CONFIG_FB_MODE_HELPERS=y +CONFIG_FB_TILEBLITTING=y +CONFIG_OMAP2_DSS=m +CONFIG_OMAP2_DSS_RFBI=y +CONFIG_OMAP2_DSS_SDI=y +CONFIG_OMAP2_DSS_DSI=y +CONFIG_FB_OMAP2=m +CONFIG_PANEL_GENERIC_DPI=m +CONFIG_PANEL_TFP410=m +CONFIG_PANEL_SHARP_LS037V7DW01=m +CONFIG_PANEL_NEC_NL8048HL11_01B=m +CONFIG_PANEL_TAAL=m +CONFIG_PANEL_TPO_TD043MTEA1=m +CONFIG_PANEL_ACX565AKM=m +CONFIG_BACKLIGHT_LCD_SUPPORT=y +CONFIG_LCD_CLASS_DEVICE=y +CONFIG_LCD_PLATFORM=y +CONFIG_FRAMEBUFFER_CONSOLE_ROTATION=y +CONFIG_FONTS=y +CONFIG_FONT_8x8=y +CONFIG_FONT_8x16=y +CONFIG_LOGO=y +CONFIG_SOUND=m +CONFIG_SND=m +CONFIG_SND_VERBOSE_PRINTK=y +CONFIG_SND_DEBUG=y +CONFIG_SND_USB_AUDIO=m +CONFIG_SND_SOC=m +CONFIG_SND_OMAP_SOC=m +CONFIG_SND_OMAP_SOC_OMAP_TWL4030=m +CONFIG_SND_OMAP_SOC_OMAP_ABE_TWL6040=m +CONFIG_SND_OMAP_SOC_OMAP3_PANDORA=m +CONFIG_USB=y +CONFIG_USB_DEBUG=y +CONFIG_USB_ANNOUNCE_NEW_DEVICES=y +CONFIG_USB_MON=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_OHCI_HCD=y +CONFIG_USB_WDM=y +CONFIG_USB_STORAGE=y +CONFIG_USB_TEST=y +CONFIG_USB_PHY=y +CONFIG_NOP_USB_XCEIV=y +CONFIG_USB_GADGET=y +CONFIG_USB_GADGET_DEBUG=y +CONFIG_USB_GADGET_DEBUG_FILES=y +CONFIG_USB_GADGET_DEBUG_FS=y +CONFIG_USB_ZERO=m +CONFIG_MMC=y +CONFIG_MMC_UNSAFE_RESUME=y +CONFIG_SDIO_UART=y +CONFIG_MMC_ARMMMCI=y +CONFIG_MMC_OMAP=y +CONFIG_MMC_OMAP_HS=y +CONFIG_NEW_LEDS=y +CONFIG_LEDS_CLASS=y +CONFIG_LEDS_GPIO=y +CONFIG_LEDS_TRIGGERS=y +CONFIG_LEDS_TRIGGER_TIMER=y +CONFIG_LEDS_TRIGGER_ONESHOT=y +CONFIG_LEDS_TRIGGER_HEARTBEAT=y +CONFIG_LEDS_TRIGGER_BACKLIGHT=y +CONFIG_LEDS_TRIGGER_CPU=y +CONFIG_LEDS_TRIGGER_GPIO=y +CONFIG_LEDS_TRIGGER_DEFAULT_ON=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_TWL92330=y +CONFIG_RTC_DRV_TWL4030=y +CONFIG_RTC_DRV_OMAP=y +CONFIG_DMADEVICES=y +CONFIG_DMA_OMAP=y +# CONFIG_EXT3_FS_XATTR is not set +CONFIG_UBIFS_FS=y +CONFIG_NFS_FS=y +CONFIG_NFS_V3_ACL=y +CONFIG_NFS_V4=y +CONFIG_ROOT_NFS=y +# CONFIG_DEBUG_BUGVERBOSE is not set +CONFIG_DEBUG_INFO=y +# CONFIG_CRYPTO_ANSI_CPRNG is not set +CONFIG_LIBCRC32C=y +# CONFIG_CPU_FREQ is not set diff --git a/linaro/configs/origen.conf b/linaro/configs/origen.conf new file mode 100644 index 000000000000..9546585f7266 --- /dev/null +++ b/linaro/configs/origen.conf @@ -0,0 +1,89 @@ +CONFIG_ARCH_EXYNOS=y +CONFIG_S3C_LOWLEVEL_UART_PORT=2 +CONFIG_S3C24XX_PWM=y +CONFIG_MACH_SMDKC210=y +CONFIG_MACH_ARMLEX4210=y +CONFIG_MACH_UNIVERSAL_C210=y +CONFIG_MACH_NURI=y +CONFIG_MACH_ORIGEN=y +CONFIG_MACH_SMDK4412=y +CONFIG_MACH_EXYNOS4_DT=y +CONFIG_NR_CPUS=2 +# CONFIG_THUMB2_KERNEL is not set +CONFIG_AEABI=y +CONFIG_CMDLINE="root=/dev/mmcblk0p1 rw rootwait console=ttySAC2,115200" +CONFIG_CPU_FREQ=y +CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y +CONFIG_CPU_FREQ_GOV_POWERSAVE=y +CONFIG_CPU_FREQ_GOV_USERSPACE=y +CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y +CONFIG_CPU_IDLE=y +CONFIG_VFP=y +CONFIG_NEON=y +CONFIG_PM_RUNTIME=y +CONFIG_RFKILL=y +CONFIG_RFKILL_GPIO=y +CONFIG_CMA=y +CONFIG_CMA_SIZE_MBYTES=32 +CONFIG_SCSI=y +CONFIG_BLK_DEV_SD=y +CONFIG_CHR_DEV_SG=y +CONFIG_CFG80211=y +CONFIG_USB_PEGASUS=y +CONFIG_USB_USBNET=y +CONFIG_USB_NET_DM9601=y +CONFIG_USB_NET_MCS7830=y +CONFIG_INPUT_EVDEV=y +CONFIG_KEYBOARD_GPIO=y +CONFIG_INPUT_TOUCHSCREEN=y +CONFIG_VT_HW_CONSOLE_BINDING=y +CONFIG_SERIAL_8250=y +CONFIG_SERIAL_SAMSUNG=y +CONFIG_SERIAL_SAMSUNG_CONSOLE=y +CONFIG_HW_RANDOM=y +CONFIG_I2C=y +CONFIG_I2C_S3C2410=y +CONFIG_THERMAL=y +CONFIG_CPU_THERMAL=y +CONFIG_SENSORS_EXYNOS4_TMU=y +CONFIG_POWER_SUPPLY=y +CONFIG_MFD_MAX8997=y +CONFIG_REGULATOR=y +CONFIG_REGULATOR_FIXED_VOLTAGE=y +CONFIG_REGULATOR_MAX8997=y +CONFIG_MEDIA_SUPPORT=y +CONFIG_MEDIA_CAMERA_SUPPORT=y +CONFIG_MEDIA_CONTROLLER=y +CONFIG_VIDEO_V4L2_SUBDEV_API=y +CONFIG_V4L_PLATFORM_DRIVERS=y +CONFIG_VIDEO_SAMSUNG_S5P_FIMC=y +CONFIG_VIDEO_S5P_FIMC=y +CONFIG_VIDEO_SAMSUNG_S5P_TV=y +CONFIG_VIDEO_SAMSUNG_S5P_HDMI=y +CONFIG_VIDEO_SAMSUNG_S5P_SDO=y +CONFIG_VIDEO_SAMSUNG_S5P_MIXER=y +CONFIG_V4L_MEM2MEM_DRIVERS=y +CONFIG_VIDEO_SAMSUNG_S5P_G2D=y +CONFIG_VIDEO_SAMSUNG_S5P_JPEG=y +CONFIG_VIDEO_SAMSUNG_S5P_MFC=y +CONFIG_FB=y +CONFIG_FB_S3C=y +CONFIG_BACKLIGHT_LCD_SUPPORT=y +CONFIG_LCD_CLASS_DEVICE=y +CONFIG_LCD_PLATFORM=y +CONFIG_BACKLIGHT_CLASS_DEVICE=y +CONFIG_BACKLIGHT_PWM=y +CONFIG_LOGO=y +CONFIG_USB=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_EHCI_S5P=y +CONFIG_USB_OHCI_HCD=y +CONFIG_USB_OHCI_EXYNOS=y +CONFIG_MMC=y +CONFIG_MMC_UNSAFE_RESUME=y +CONFIG_MMC_SDHCI=y +CONFIG_MMC_SDHCI_S3C=y +CONFIG_MMC_SDHCI_S3C_DMA=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_S3C=y +CONFIG_DEBUG_S3C_UART2=y diff --git a/linaro/configs/panda.conf b/linaro/configs/panda.conf new file mode 100644 index 000000000000..1771853ceeae --- /dev/null +++ b/linaro/configs/panda.conf @@ -0,0 +1,11 @@ +# CONFIG_ARCH_OMAP2 is not set +CONFIG_GPIO_PCA953X=y +CONFIG_REGULATOR_DUMMY=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_OHCI_HCD=y +CONFIG_LEDS_CLASS=y +CONFIG_WL_TI=y +CONFIG_WL12XX=m +CONFIG_WLCORE=m +CONFIG_WLCORE_SDIO=m +CONFIG_WILINK_PLATFORM_DATA=y diff --git a/linaro/configs/preempt-rt.conf b/linaro/configs/preempt-rt.conf new file mode 100644 index 000000000000..7c6594f3a016 --- /dev/null +++ b/linaro/configs/preempt-rt.conf @@ -0,0 +1,3 @@ +CONFIG_PREEMPT=y +CONFIG_PREEMPT_RT_FULL=y +# CONFIG_CPU_FREQ is not set diff --git a/linaro/configs/u8500.conf b/linaro/configs/u8500.conf new file mode 100644 index 000000000000..86e294faa2f6 --- /dev/null +++ b/linaro/configs/u8500.conf @@ -0,0 +1,83 @@ +# CONFIG_SWAP is not set +CONFIG_KALLSYMS_ALL=y +# CONFIG_BLK_DEV_BSG is not set +CONFIG_ARCH_U8500=y +CONFIG_MACH_HREFV60=y +CONFIG_MACH_SNOWBALL=y +CONFIG_MACH_UX500_DT=y +CONFIG_NR_CPUS=2 +CONFIG_PREEMPT=y +CONFIG_CMDLINE="root=/dev/ram0 console=ttyAMA2,115200n8" +CONFIG_VFP=y +CONFIG_NEON=y +CONFIG_PM_RUNTIME=y +CONFIG_INET6_XFRM_MODE_TRANSPORT=m +CONFIG_INET6_XFRM_MODE_TUNNEL=m +CONFIG_INET6_XFRM_MODE_BEET=m +CONFIG_IPV6_SIT=m +CONFIG_PHONET=y +# CONFIG_WIRELESS is not set +CONFIG_CAIF=y +CONFIG_SENSORS_BH1780=y +CONFIG_SMSC911X=y +CONFIG_SMSC_PHY=y +# CONFIG_WLAN is not set +# CONFIG_INPUT_MOUSEDEV_PSAUX is not set +CONFIG_INPUT_EVDEV=y +# CONFIG_KEYBOARD_ATKBD is not set +CONFIG_KEYBOARD_GPIO=y +CONFIG_KEYBOARD_NOMADIK=y +CONFIG_KEYBOARD_STMPE=y +CONFIG_KEYBOARD_TC3589X=y +# CONFIG_INPUT_MOUSE is not set +CONFIG_INPUT_TOUCHSCREEN=y +CONFIG_TOUCHSCREEN_BU21013=y +CONFIG_INPUT_AB8500_PONKEY=y +# CONFIG_SERIO is not set +CONFIG_VT_HW_CONSOLE_BINDING=y +# CONFIG_LEGACY_PTYS is not set +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +CONFIG_HW_RANDOM=y +CONFIG_SPI=y +CONFIG_SPI_PL022=y +CONFIG_GPIO_STMPE=y +CONFIG_GPIO_TC3589X=y +CONFIG_THERMAL=y +CONFIG_CPU_THERMAL=y +CONFIG_MFD_STMPE=y +CONFIG_MFD_TC3589X=y +CONFIG_AB8500_CORE=y +CONFIG_REGULATOR_GPIO=y +CONFIG_REGULATOR_AB8500=y +CONFIG_USB_GADGET=y +CONFIG_AB8500_USB=y +CONFIG_MMC=y +CONFIG_MMC_CLKGATE=y +CONFIG_MMC_ARMMMCI=y +CONFIG_NEW_LEDS=y +CONFIG_LEDS_CLASS=y +CONFIG_LEDS_LM3530=y +CONFIG_LEDS_GPIO=y +CONFIG_LEDS_LP5521=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_AB8500=y +CONFIG_RTC_DRV_PL031=y +CONFIG_DMADEVICES=y +CONFIG_STE_DMA40=y +CONFIG_STAGING=y +CONFIG_TOUCHSCREEN_SYNAPTICS_I2C_RMI4=y +CONFIG_HSEM_U8500=y +CONFIG_EXT2_FS_XATTR=y +CONFIG_EXT2_FS_POSIX_ACL=y +CONFIG_EXT2_FS_SECURITY=y +CONFIG_CONFIGFS_FS=m +# CONFIG_MISC_FILESYSTEMS is not set +CONFIG_NFS_FS=y +CONFIG_ROOT_NFS=y +CONFIG_DEBUG_KERNEL=y +# CONFIG_SCHED_DEBUG is not set +# CONFIG_DEBUG_PREEMPT is not set +CONFIG_DEBUG_INFO=y +# CONFIG_FTRACE is not set +CONFIG_DEBUG_USER=y diff --git a/linaro/configs/uprobes.conf b/linaro/configs/uprobes.conf new file mode 100644 index 000000000000..a9617986284a --- /dev/null +++ b/linaro/configs/uprobes.conf @@ -0,0 +1,7 @@ +CONFIG_NAMESPACES=y +CONFIG_USER_NS=y +CONFIG_RELAY=y +CONFIG_KPROBES=y +CONFIG_UPROBE_EVENT=y +CONFIG_KPROBES_SANITY_TEST=y +CONFIG_ARM_KPROBES_TEST=m diff --git a/linaro/configs/vexpress.conf b/linaro/configs/vexpress.conf new file mode 100644 index 000000000000..8370cf15be65 --- /dev/null +++ b/linaro/configs/vexpress.conf @@ -0,0 +1,64 @@ +CONFIG_ARCH_VEXPRESS=y +CONFIG_HAVE_ARM_ARCH_TIMER=y +CONFIG_NR_CPUS=8 +CONFIG_HIGHMEM=y +CONFIG_HIGHPTE=y +CONFIG_ARM_PSCI=y +CONFIG_MCPM=y +CONFIG_ARCH_VEXPRESS_DCSCB=y +CONFIG_ARCH_VEXPRESS_TC2_PM=y +CONFIG_ARM_BIG_LITTLE_CPUIDLE=y +CONFIG_BIG_LITTLE=y +CONFIG_ARM_BIG_LITTLE_CPUFREQ=y +CONFIG_ARM_VEXPRESS_SPC_CPUFREQ=y +CONFIG_PM_OPP=y +CONFIG_CPU_FREQ=y +CONFIG_CPU_FREQ_GOV_ONDEMAND=y +CONFIG_CPU_FREQ_GOV_PERFORMANCE=y +CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y +CONFIG_CMDLINE="console=ttyAMA0,38400n8 root=/dev/mmcblk0p2 rootwait mmci.fmax=4000000" +CONFIG_VFP=y +CONFIG_NEON=y +CONFIG_SCSI=y +CONFIG_BLK_DEV_SD=y +CONFIG_SMSC911X=y +CONFIG_SMC91X=y +CONFIG_INPUT_EVDEV=y +CONFIG_SERIO_AMBAKMI=y +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +CONFIG_FB=y +CONFIG_FB_ARMCLCD=y +CONFIG_FB_ARMHDLCD=y +CONFIG_LOGO=y +# CONFIG_LOGO_LINUX_MONO is not set +# CONFIG_LOGO_LINUX_VGA16 is not set +CONFIG_SOUND=y +CONFIG_SND=y +CONFIG_SND_ARMAACI=y +CONFIG_USB=y +CONFIG_USB_ISP1760=y +CONFIG_USB_STORAGE=y +CONFIG_MMC=y +CONFIG_MMC_ARMMMCI=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_PL031=y +CONFIG_NFS_FS=y +CONFIG_NFS_V3=y +CONFIG_NFS_V3_ACL=y +CONFIG_NFS_V4=y +CONFIG_ROOT_NFS=y +CONFIG_VEXPRESS_CONFIG=y +CONFIG_SENSORS_VEXPRESS=y +CONFIG_REGULATOR=y +CONFIG_REGULATOR_VEXPRESS=y +CONFIG_NEW_LEDS=y +CONFIG_LEDS_CLASS=y +CONFIG_LEDS_GPIO=y +CONFIG_LEDS_TRIGGERS=y +CONFIG_LEDS_TRIGGER_HEARTBEAT=y +CONFIG_LEDS_TRIGGER_CPU=y +CONFIG_VIRTIO=y +CONFIG_VIRTIO_BLK=y +CONFIG_VIRTIO_MMIO=y +CONFIG_VIRTIO_MMIO_CMDLINE_DEVICES=y diff --git a/linaro/configs/vexpress64.conf b/linaro/configs/vexpress64.conf new file mode 100644 index 000000000000..f88ca4ac062c --- /dev/null +++ b/linaro/configs/vexpress64.conf @@ -0,0 +1,63 @@ +CONFIG_ARCH_VEXPRESS=y +CONFIG_SMP=y +CONFIG_NR_CPUS=8 +CONFIG_HOTPLUG_CPU=y +CONFIG_PREEMPT=y +CONFIG_CMDLINE="console=ttyAMA0" +CONFIG_COMPAT=y +CONFIG_SMC91X=y +CONFIG_INPUT_EVDEV=y +CONFIG_SERIO_AMBAKMI=y +CONFIG_SERIAL_AMBA_PL011=y +CONFIG_SERIAL_AMBA_PL011_CONSOLE=y +# CONFIG_SERIO_I8042 is not set +CONFIG_FB=y +CONFIG_FB_ARMCLCD=y +CONFIG_FRAMEBUFFER_CONSOLE=y +# CONFIG_VGA_CONSOLE is not set +CONFIG_LOGO=y +# CONFIG_LOGO_LINUX_MONO is not set +# CONFIG_LOGO_LINUX_VGA16 is not set +CONFIG_MMC=y +CONFIG_MMC_ARMMMCI=y +CONFIG_RTC_CLASS=y +CONFIG_RTC_DRV_PL031=y +CONFIG_NFS_FS=y +CONFIG_NFS_V3=y +CONFIG_NFS_V3_ACL=y +CONFIG_NFS_V4=y +CONFIG_ROOT_NFS=y +CONFIG_VIRTIO=y +CONFIG_VIRTIO_BLK=y +CONFIG_VIRTIO_MMIO=y +CONFIG_REGULATOR=y +CONFIG_REGULATOR_FIXED_VOLTAGE=y +# CONFIG_DEBUG_PREEMPT is not set +CONFIG_SMSC911X=y +CONFIG_I2C=y +CONFIG_I2C_DESIGNWARE_PLATFORM=y +CONFIG_USB_HIDDEV=y +CONFIG_SCSI=y +CONFIG_BLK_DEV_SD=y +CONFIG_USB_STORAGE=y +CONFIG_USB=y +CONFIG_USB_ULPI=y +CONFIG_USB_EHCI_HCD=y +CONFIG_USB_EHCI_HCD_PLATFORM=y +CONFIG_NOP_USB_XCEIV=y +CONFIG_USB_OHCI_HCD=y +CONFIG_PCI=y +CONFIG_PCI_MSI=y +CONFIG_PCI_REALLOC_ENABLE_AUTO=y +CONFIG_PCI_PRI=y +CONFIG_PCI_PASID=y +CONFIG_PCI_HOST_GENERIC=y +CONFIG_PCIEPORTBUS=y +CONFIG_HOTPLUG_PCI=y +CONFIG_HOTPLUG_PCI_PCIE=y +CONFIG_PCIEAER=y +CONFIG_PCIE_ECRC=y +CONFIG_CONNECTOR=y +CONFIG_ATA=y +CONFIG_SATA_SIL24=y +CONFIG_SKY2=y diff --git a/linaro/configs/xen.conf b/linaro/configs/xen.conf new file mode 100644 index 000000000000..d24fabbea076 --- /dev/null +++ b/linaro/configs/xen.conf @@ -0,0 +1,7 @@ +CONFIG_XEN=y +CONFIG_XEN_NETDEV_FRONTEND=y +CONFIG_XEN_NETDEV_BACKEND=y +CONFIG_XEN_BLKDEV_FRONTEND=y +CONFIG_XEN_BLKDEV_BACKEND=y +CONFIG_XENFS=y +CONFIG_XEN_COMPAT_XENFS=y diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 000e7b3b9896..e1bd6e864152 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -18,6 +18,7 @@ struct backing_dev_info noop_backing_dev_info = { .name = "noop", .capabilities = BDI_CAP_NO_ACCT_AND_WRITEBACK, }; +EXPORT_SYMBOL_GPL(noop_backing_dev_info); static struct class *bdi_class; @@ -48,7 +49,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) struct bdi_writeback *wb = &bdi->wb; unsigned long background_thresh; unsigned long dirty_thresh; - unsigned long bdi_thresh; + unsigned long wb_thresh; unsigned long nr_dirty, nr_io, nr_more_io, nr_dirty_time; struct inode *inode; @@ -66,7 +67,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) spin_unlock(&wb->list_lock); global_dirty_limits(&background_thresh, &dirty_thresh); - bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); + wb_thresh = wb_calc_thresh(wb, dirty_thresh); #define K(x) ((x) << (PAGE_SHIFT - 10)) seq_printf(m, @@ -84,19 +85,19 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v) "b_dirty_time: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", - (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), - (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)), - K(bdi_thresh), + (unsigned long) K(wb_stat(wb, WB_WRITEBACK)), + (unsigned long) K(wb_stat(wb, WB_RECLAIMABLE)), + K(wb_thresh), K(dirty_thresh), K(background_thresh), - (unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)), - (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)), - (unsigned long) K(bdi->write_bandwidth), + (unsigned long) K(wb_stat(wb, WB_DIRTIED)), + (unsigned long) K(wb_stat(wb, WB_WRITTEN)), + (unsigned long) K(wb->write_bandwidth), nr_dirty, nr_io, nr_more_io, nr_dirty_time, - !list_empty(&bdi->bdi_list), bdi->state); + !list_empty(&bdi->bdi_list), bdi->wb.state); #undef K return 0; @@ -255,13 +256,8 @@ static int __init default_bdi_init(void) } subsys_initcall(default_bdi_init); -int bdi_has_dirty_io(struct backing_dev_info *bdi) -{ - return wb_has_dirty_io(&bdi->wb); -} - /* - * This function is used when the first inode for this bdi is marked dirty. It + * This function is used when the first inode for this wb is marked dirty. It * wakes-up the corresponding bdi thread which should then take care of the * periodic background write-out of dirty inodes. Since the write-out would * starts only 'dirty_writeback_interval' centisecs from now anyway, we just @@ -274,162 +270,582 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi) * We have to be careful not to postpone flush work if it is scheduled for * earlier. Thus we use queue_delayed_work(). */ -void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi) +void wb_wakeup_delayed(struct bdi_writeback *wb) { unsigned long timeout; timeout = msecs_to_jiffies(dirty_writeback_interval * 10); - spin_lock_bh(&bdi->wb_lock); - if (test_bit(BDI_registered, &bdi->state)) - queue_delayed_work(bdi_wq, &bdi->wb.dwork, timeout); - spin_unlock_bh(&bdi->wb_lock); + spin_lock_bh(&wb->work_lock); + if (test_bit(WB_registered, &wb->state)) + queue_delayed_work(bdi_wq, &wb->dwork, timeout); + spin_unlock_bh(&wb->work_lock); } /* - * Remove bdi from bdi_list, and ensure that it is no longer visible + * Initial write bandwidth: 100 MB/s */ -static void bdi_remove_from_list(struct backing_dev_info *bdi) +#define INIT_BW (100 << (20 - PAGE_SHIFT)) + +static int wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi, + int blkcg_id, gfp_t gfp) { - spin_lock_bh(&bdi_lock); - list_del_rcu(&bdi->bdi_list); - spin_unlock_bh(&bdi_lock); + int i, err; - synchronize_rcu_expedited(); -} + memset(wb, 0, sizeof(*wb)); -int bdi_register(struct backing_dev_info *bdi, struct device *parent, - const char *fmt, ...) -{ - va_list args; - struct device *dev; + wb->bdi = bdi; + wb->last_old_flush = jiffies; + INIT_LIST_HEAD(&wb->b_dirty); + INIT_LIST_HEAD(&wb->b_io); + INIT_LIST_HEAD(&wb->b_more_io); + INIT_LIST_HEAD(&wb->b_dirty_time); + spin_lock_init(&wb->list_lock); - if (bdi->dev) /* The driver needs to use separate queues per device */ - return 0; + wb->bw_time_stamp = jiffies; + wb->balanced_dirty_ratelimit = INIT_BW; + wb->dirty_ratelimit = INIT_BW; + wb->write_bandwidth = INIT_BW; + wb->avg_write_bandwidth = INIT_BW; - va_start(args, fmt); - dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args); - va_end(args); - if (IS_ERR(dev)) - return PTR_ERR(dev); + spin_lock_init(&wb->work_lock); + INIT_LIST_HEAD(&wb->work_list); + INIT_DELAYED_WORK(&wb->dwork, wb_workfn); - bdi->dev = dev; + wb->congested = wb_congested_get_create(bdi, blkcg_id, gfp); + if (!wb->congested) + return -ENOMEM; - bdi_debug_register(bdi, dev_name(dev)); - set_bit(BDI_registered, &bdi->state); + err = fprop_local_init_percpu(&wb->completions, gfp); + if (err) + goto out_put_cong; - spin_lock_bh(&bdi_lock); - list_add_tail_rcu(&bdi->bdi_list, &bdi_list); - spin_unlock_bh(&bdi_lock); + for (i = 0; i < NR_WB_STAT_ITEMS; i++) { + err = percpu_counter_init(&wb->stat[i], 0, gfp); + if (err) + goto out_destroy_stat; + } - trace_writeback_bdi_register(bdi); return 0; -} -EXPORT_SYMBOL(bdi_register); -int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev) -{ - return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev)); +out_destroy_stat: + while (--i) + percpu_counter_destroy(&wb->stat[i]); + fprop_local_destroy_percpu(&wb->completions); +out_put_cong: + wb_congested_put(wb->congested); + return err; } -EXPORT_SYMBOL(bdi_register_dev); /* * Remove bdi from the global list and shutdown any threads we have running */ -static void bdi_wb_shutdown(struct backing_dev_info *bdi) +static void wb_shutdown(struct bdi_writeback *wb) { /* Make sure nobody queues further work */ - spin_lock_bh(&bdi->wb_lock); - if (!test_and_clear_bit(BDI_registered, &bdi->state)) { - spin_unlock_bh(&bdi->wb_lock); + spin_lock_bh(&wb->work_lock); + if (!test_and_clear_bit(WB_registered, &wb->state)) { + spin_unlock_bh(&wb->work_lock); return; } - spin_unlock_bh(&bdi->wb_lock); + spin_unlock_bh(&wb->work_lock); /* - * Make sure nobody finds us on the bdi_list anymore + * Drain work list and shutdown the delayed_work. !WB_registered + * tells wb_workfn() that @wb is dying and its work_list needs to + * be drained no matter what. */ - bdi_remove_from_list(bdi); + mod_delayed_work(bdi_wq, &wb->dwork, 0); + flush_delayed_work(&wb->dwork); + WARN_ON(!list_empty(&wb->work_list)); +} + +static void wb_exit(struct bdi_writeback *wb) +{ + int i; + + WARN_ON(delayed_work_pending(&wb->dwork)); + + for (i = 0; i < NR_WB_STAT_ITEMS; i++) + percpu_counter_destroy(&wb->stat[i]); + + fprop_local_destroy_percpu(&wb->completions); + wb_congested_put(wb->congested); +} + +#ifdef CONFIG_CGROUP_WRITEBACK + +#include <linux/memcontrol.h> + +/* + * cgwb_lock protects bdi->cgwb_tree, bdi->cgwb_congested_tree, + * blkcg->cgwb_list, and memcg->cgwb_list. bdi->cgwb_tree is also RCU + * protected. cgwb_release_wait is used to wait for the completion of cgwb + * releases from bdi destruction path. + */ +static DEFINE_SPINLOCK(cgwb_lock); +static DECLARE_WAIT_QUEUE_HEAD(cgwb_release_wait); + +/** + * wb_congested_get_create - get or create a wb_congested + * @bdi: associated bdi + * @blkcg_id: ID of the associated blkcg + * @gfp: allocation mask + * + * Look up the wb_congested for @blkcg_id on @bdi. If missing, create one. + * The returned wb_congested has its reference count incremented. Returns + * NULL on failure. + */ +struct bdi_writeback_congested * +wb_congested_get_create(struct backing_dev_info *bdi, int blkcg_id, gfp_t gfp) +{ + struct bdi_writeback_congested *new_congested = NULL, *congested; + struct rb_node **node, *parent; + unsigned long flags; +retry: + spin_lock_irqsave(&cgwb_lock, flags); + + node = &bdi->cgwb_congested_tree.rb_node; + parent = NULL; + + while (*node != NULL) { + parent = *node; + congested = container_of(parent, struct bdi_writeback_congested, + rb_node); + if (congested->blkcg_id < blkcg_id) + node = &parent->rb_left; + else if (congested->blkcg_id > blkcg_id) + node = &parent->rb_right; + else + goto found; + } + + if (new_congested) { + /* !found and storage for new one already allocated, insert */ + congested = new_congested; + new_congested = NULL; + rb_link_node(&congested->rb_node, parent, node); + rb_insert_color(&congested->rb_node, &bdi->cgwb_congested_tree); + goto found; + } + + spin_unlock_irqrestore(&cgwb_lock, flags); + + /* allocate storage for new one and retry */ + new_congested = kzalloc(sizeof(*new_congested), gfp); + if (!new_congested) + return NULL; + + atomic_set(&new_congested->refcnt, 0); + new_congested->bdi = bdi; + new_congested->blkcg_id = blkcg_id; + goto retry; + +found: + atomic_inc(&congested->refcnt); + spin_unlock_irqrestore(&cgwb_lock, flags); + kfree(new_congested); + return congested; +} + +/** + * wb_congested_put - put a wb_congested + * @congested: wb_congested to put + * + * Put @congested and destroy it if the refcnt reaches zero. + */ +void wb_congested_put(struct bdi_writeback_congested *congested) +{ + unsigned long flags; + + local_irq_save(flags); + if (!atomic_dec_and_lock(&congested->refcnt, &cgwb_lock)) { + local_irq_restore(flags); + return; + } + + /* bdi might already have been destroyed leaving @congested unlinked */ + if (congested->bdi) { + rb_erase(&congested->rb_node, + &congested->bdi->cgwb_congested_tree); + congested->bdi = NULL; + } + + spin_unlock_irqrestore(&cgwb_lock, flags); + kfree(congested); +} + +static void cgwb_release_workfn(struct work_struct *work) +{ + struct bdi_writeback *wb = container_of(work, struct bdi_writeback, + release_work); + struct backing_dev_info *bdi = wb->bdi; + + wb_shutdown(wb); + + css_put(wb->memcg_css); + css_put(wb->blkcg_css); + + fprop_local_destroy_percpu(&wb->memcg_completions); + percpu_ref_exit(&wb->refcnt); + wb_exit(wb); + kfree_rcu(wb, rcu); + + if (atomic_dec_and_test(&bdi->usage_cnt)) + wake_up_all(&cgwb_release_wait); +} + +static void cgwb_release(struct percpu_ref *refcnt) +{ + struct bdi_writeback *wb = container_of(refcnt, struct bdi_writeback, + refcnt); + schedule_work(&wb->release_work); +} + +static void cgwb_kill(struct bdi_writeback *wb) +{ + lockdep_assert_held(&cgwb_lock); + + WARN_ON(!radix_tree_delete(&wb->bdi->cgwb_tree, wb->memcg_css->id)); + list_del(&wb->memcg_node); + list_del(&wb->blkcg_node); + percpu_ref_kill(&wb->refcnt); +} + +static int cgwb_create(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css, gfp_t gfp) +{ + struct mem_cgroup *memcg; + struct cgroup_subsys_state *blkcg_css; + struct blkcg *blkcg; + struct list_head *memcg_cgwb_list, *blkcg_cgwb_list; + struct bdi_writeback *wb; + unsigned long flags; + int ret = 0; + + memcg = mem_cgroup_from_css(memcg_css); + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, &blkio_cgrp_subsys); + blkcg = css_to_blkcg(blkcg_css); + memcg_cgwb_list = mem_cgroup_cgwb_list(memcg); + blkcg_cgwb_list = &blkcg->cgwb_list; + + /* look up again under lock and discard on blkcg mismatch */ + spin_lock_irqsave(&cgwb_lock, flags); + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + if (wb && wb->blkcg_css != blkcg_css) { + cgwb_kill(wb); + wb = NULL; + } + spin_unlock_irqrestore(&cgwb_lock, flags); + if (wb) + goto out_put; + + /* need to create a new one */ + wb = kmalloc(sizeof(*wb), gfp); + if (!wb) + return -ENOMEM; + + ret = wb_init(wb, bdi, blkcg_css->id, gfp); + if (ret) + goto err_free; + + ret = percpu_ref_init(&wb->refcnt, cgwb_release, 0, gfp); + if (ret) + goto err_wb_exit; + + ret = fprop_local_init_percpu(&wb->memcg_completions, gfp); + if (ret) + goto err_ref_exit; + + wb->memcg_css = memcg_css; + wb->blkcg_css = blkcg_css; + INIT_WORK(&wb->release_work, cgwb_release_workfn); + set_bit(WB_registered, &wb->state); /* - * Drain work list and shutdown the delayed_work. At this point, - * @bdi->bdi_list is empty telling bdi_Writeback_workfn() that @bdi - * is dying and its work_list needs to be drained no matter what. + * The root wb determines the registered state of the whole bdi and + * memcg_cgwb_list and blkcg_cgwb_list's next pointers indicate + * whether they're still online. Don't link @wb if any is dead. + * See wb_memcg_offline() and wb_blkcg_offline(). */ - mod_delayed_work(bdi_wq, &bdi->wb.dwork, 0); - flush_delayed_work(&bdi->wb.dwork); + ret = -ENODEV; + spin_lock_irqsave(&cgwb_lock, flags); + if (test_bit(WB_registered, &bdi->wb.state) && + blkcg_cgwb_list->next && memcg_cgwb_list->next) { + /* we might have raced another instance of this function */ + ret = radix_tree_insert(&bdi->cgwb_tree, memcg_css->id, wb); + if (!ret) { + atomic_inc(&bdi->usage_cnt); + list_add(&wb->memcg_node, memcg_cgwb_list); + list_add(&wb->blkcg_node, blkcg_cgwb_list); + css_get(memcg_css); + css_get(blkcg_css); + } + } + spin_unlock_irqrestore(&cgwb_lock, flags); + if (ret) { + if (ret == -EEXIST) + ret = 0; + goto err_fprop_exit; + } + goto out_put; + +err_fprop_exit: + fprop_local_destroy_percpu(&wb->memcg_completions); +err_ref_exit: + percpu_ref_exit(&wb->refcnt); +err_wb_exit: + wb_exit(wb); +err_free: + kfree(wb); +out_put: + css_put(blkcg_css); + return ret; } -static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi) +/** + * wb_get_create - get wb for a given memcg, create if necessary + * @bdi: target bdi + * @memcg_css: cgroup_subsys_state of the target memcg (must have positive ref) + * @gfp: allocation mask to use + * + * Try to get the wb for @memcg_css on @bdi. If it doesn't exist, try to + * create one. The returned wb has its refcount incremented. + * + * This function uses css_get() on @memcg_css and thus expects its refcnt + * to be positive on invocation. IOW, rcu_read_lock() protection on + * @memcg_css isn't enough. try_get it before calling this function. + * + * A wb is keyed by its associated memcg. As blkcg implicitly enables + * memcg on the default hierarchy, memcg association is guaranteed to be + * more specific (equal or descendant to the associated blkcg) and thus can + * identify both the memcg and blkcg associations. + * + * Because the blkcg associated with a memcg may change as blkcg is enabled + * and disabled closer to root in the hierarchy, each wb keeps track of + * both the memcg and blkcg associated with it and verifies the blkcg on + * each lookup. On mismatch, the existing wb is discarded and a new one is + * created. + */ +struct bdi_writeback *wb_get_create(struct backing_dev_info *bdi, + struct cgroup_subsys_state *memcg_css, + gfp_t gfp) { - memset(wb, 0, sizeof(*wb)); + struct bdi_writeback *wb; + + might_sleep_if(gfp & __GFP_WAIT); + + if (!memcg_css->parent) + return &bdi->wb; + + do { + rcu_read_lock(); + wb = radix_tree_lookup(&bdi->cgwb_tree, memcg_css->id); + if (wb) { + struct cgroup_subsys_state *blkcg_css; + + /* see whether the blkcg association has changed */ + blkcg_css = cgroup_get_e_css(memcg_css->cgroup, + &blkio_cgrp_subsys); + if (unlikely(wb->blkcg_css != blkcg_css || + !wb_tryget(wb))) + wb = NULL; + css_put(blkcg_css); + } + rcu_read_unlock(); + } while (!wb && !cgwb_create(bdi, memcg_css, gfp)); + + return wb; +} - wb->bdi = bdi; - wb->last_old_flush = jiffies; - INIT_LIST_HEAD(&wb->b_dirty); - INIT_LIST_HEAD(&wb->b_io); - INIT_LIST_HEAD(&wb->b_more_io); - INIT_LIST_HEAD(&wb->b_dirty_time); - spin_lock_init(&wb->list_lock); - INIT_DELAYED_WORK(&wb->dwork, bdi_writeback_workfn); +static int cgwb_bdi_init(struct backing_dev_info *bdi) +{ + int ret; + + INIT_RADIX_TREE(&bdi->cgwb_tree, GFP_ATOMIC); + bdi->cgwb_congested_tree = RB_ROOT; + atomic_set(&bdi->usage_cnt, 1); + + ret = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL); + if (!ret) { + bdi->wb.memcg_css = mem_cgroup_root_css; + bdi->wb.blkcg_css = blkcg_root_css; + } + return ret; } -/* - * Initial write bandwidth: 100 MB/s +static void cgwb_bdi_destroy(struct backing_dev_info *bdi) +{ + struct radix_tree_iter iter; + struct bdi_writeback_congested *congested, *congested_n; + void **slot; + + WARN_ON(test_bit(WB_registered, &bdi->wb.state)); + + spin_lock_irq(&cgwb_lock); + + radix_tree_for_each_slot(slot, &bdi->cgwb_tree, &iter, 0) + cgwb_kill(*slot); + + rbtree_postorder_for_each_entry_safe(congested, congested_n, + &bdi->cgwb_congested_tree, rb_node) { + rb_erase(&congested->rb_node, &bdi->cgwb_congested_tree); + congested->bdi = NULL; /* mark @congested unlinked */ + } + + spin_unlock_irq(&cgwb_lock); + + /* + * All cgwb's and their congested states must be shutdown and + * released before returning. Drain the usage counter to wait for + * all cgwb's and cgwb_congested's ever created on @bdi. + */ + atomic_dec(&bdi->usage_cnt); + wait_event(cgwb_release_wait, !atomic_read(&bdi->usage_cnt)); +} + +/** + * wb_memcg_offline - kill all wb's associated with a memcg being offlined + * @memcg: memcg being offlined + * + * Also prevents creation of any new wb's associated with @memcg. */ -#define INIT_BW (100 << (20 - PAGE_SHIFT)) +void wb_memcg_offline(struct mem_cgroup *memcg) +{ + LIST_HEAD(to_destroy); + struct list_head *memcg_cgwb_list = mem_cgroup_cgwb_list(memcg); + struct bdi_writeback *wb, *next; + + spin_lock_irq(&cgwb_lock); + list_for_each_entry_safe(wb, next, memcg_cgwb_list, memcg_node) + cgwb_kill(wb); + memcg_cgwb_list->next = NULL; /* prevent new wb's */ + spin_unlock_irq(&cgwb_lock); +} -int bdi_init(struct backing_dev_info *bdi) +/** + * wb_blkcg_offline - kill all wb's associated with a blkcg being offlined + * @blkcg: blkcg being offlined + * + * Also prevents creation of any new wb's associated with @blkcg. + */ +void wb_blkcg_offline(struct blkcg *blkcg) { - int i, err; + LIST_HEAD(to_destroy); + struct bdi_writeback *wb, *next; + + spin_lock_irq(&cgwb_lock); + list_for_each_entry_safe(wb, next, &blkcg->cgwb_list, blkcg_node) + cgwb_kill(wb); + blkcg->cgwb_list.next = NULL; /* prevent new wb's */ + spin_unlock_irq(&cgwb_lock); +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static int cgwb_bdi_init(struct backing_dev_info *bdi) +{ + int err; + + bdi->wb_congested = kzalloc(sizeof(*bdi->wb_congested), GFP_KERNEL); + if (!bdi->wb_congested) + return -ENOMEM; + + err = wb_init(&bdi->wb, bdi, 1, GFP_KERNEL); + if (err) { + kfree(bdi->wb_congested); + return err; + } + return 0; +} + +static void cgwb_bdi_destroy(struct backing_dev_info *bdi) { } + +#endif /* CONFIG_CGROUP_WRITEBACK */ +int bdi_init(struct backing_dev_info *bdi) +{ bdi->dev = NULL; bdi->min_ratio = 0; bdi->max_ratio = 100; bdi->max_prop_frac = FPROP_FRAC_BASE; - spin_lock_init(&bdi->wb_lock); INIT_LIST_HEAD(&bdi->bdi_list); - INIT_LIST_HEAD(&bdi->work_list); + init_waitqueue_head(&bdi->wb_waitq); - bdi_wb_init(&bdi->wb, bdi); + return cgwb_bdi_init(bdi); +} +EXPORT_SYMBOL(bdi_init); - for (i = 0; i < NR_BDI_STAT_ITEMS; i++) { - err = percpu_counter_init(&bdi->bdi_stat[i], 0, GFP_KERNEL); - if (err) - goto err; - } +int bdi_register(struct backing_dev_info *bdi, struct device *parent, + const char *fmt, ...) +{ + va_list args; + struct device *dev; - bdi->dirty_exceeded = 0; + if (bdi->dev) /* The driver needs to use separate queues per device */ + return 0; - bdi->bw_time_stamp = jiffies; - bdi->written_stamp = 0; + va_start(args, fmt); + dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args); + va_end(args); + if (IS_ERR(dev)) + return PTR_ERR(dev); - bdi->balanced_dirty_ratelimit = INIT_BW; - bdi->dirty_ratelimit = INIT_BW; - bdi->write_bandwidth = INIT_BW; - bdi->avg_write_bandwidth = INIT_BW; + bdi->dev = dev; - err = fprop_local_init_percpu(&bdi->completions, GFP_KERNEL); + bdi_debug_register(bdi, dev_name(dev)); + set_bit(WB_registered, &bdi->wb.state); - if (err) { -err: - while (i--) - percpu_counter_destroy(&bdi->bdi_stat[i]); - } + spin_lock_bh(&bdi_lock); + list_add_tail_rcu(&bdi->bdi_list, &bdi_list); + spin_unlock_bh(&bdi_lock); - return err; + trace_writeback_bdi_register(bdi); + return 0; } -EXPORT_SYMBOL(bdi_init); +EXPORT_SYMBOL(bdi_register); -void bdi_destroy(struct backing_dev_info *bdi) +int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev) { - int i; + return bdi_register(bdi, NULL, "%u:%u", MAJOR(dev), MINOR(dev)); +} +EXPORT_SYMBOL(bdi_register_dev); + +/* + * Remove bdi from bdi_list, and ensure that it is no longer visible + */ +static void bdi_remove_from_list(struct backing_dev_info *bdi) +{ + spin_lock_bh(&bdi_lock); + list_del_rcu(&bdi->bdi_list); + spin_unlock_bh(&bdi_lock); + + synchronize_rcu_expedited(); +} + +/* + * Called when the device behind @bdi has been removed or ejected. + * + * We can't really do much here except for reducing the dirty ratio at + * the moment. In the future we should be able to set a flag so that + * the filesystem can handle errors at mark_inode_dirty time instead + * of only at writeback time. + */ +void bdi_unregister(struct backing_dev_info *bdi) +{ + if (WARN_ON_ONCE(!bdi->dev)) + return; - bdi_wb_shutdown(bdi); bdi_set_min_ratio(bdi, 0); +} +EXPORT_SYMBOL(bdi_unregister); - WARN_ON(!list_empty(&bdi->work_list)); - WARN_ON(delayed_work_pending(&bdi->wb.dwork)); +void bdi_destroy(struct backing_dev_info *bdi) +{ + /* make sure nobody finds us on the bdi_list anymore */ + bdi_remove_from_list(bdi); + wb_shutdown(&bdi->wb); + cgwb_bdi_destroy(bdi); if (bdi->dev) { bdi_debug_unregister(bdi); @@ -437,9 +853,7 @@ void bdi_destroy(struct backing_dev_info *bdi) bdi->dev = NULL; } - for (i = 0; i < NR_BDI_STAT_ITEMS; i++) - percpu_counter_destroy(&bdi->bdi_stat[i]); - fprop_local_destroy_percpu(&bdi->completions); + wb_exit(&bdi->wb); } EXPORT_SYMBOL(bdi_destroy); @@ -472,31 +886,31 @@ static wait_queue_head_t congestion_wqh[2] = { __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]), __WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1]) }; -static atomic_t nr_bdi_congested[2]; +static atomic_t nr_wb_congested[2]; -void clear_bdi_congested(struct backing_dev_info *bdi, int sync) +void clear_wb_congested(struct bdi_writeback_congested *congested, int sync) { - enum bdi_state bit; wait_queue_head_t *wqh = &congestion_wqh[sync]; + enum wb_state bit; - bit = sync ? BDI_sync_congested : BDI_async_congested; - if (test_and_clear_bit(bit, &bdi->state)) - atomic_dec(&nr_bdi_congested[sync]); + bit = sync ? WB_sync_congested : WB_async_congested; + if (test_and_clear_bit(bit, &congested->state)) + atomic_dec(&nr_wb_congested[sync]); smp_mb__after_atomic(); if (waitqueue_active(wqh)) wake_up(wqh); } -EXPORT_SYMBOL(clear_bdi_congested); +EXPORT_SYMBOL(clear_wb_congested); -void set_bdi_congested(struct backing_dev_info *bdi, int sync) +void set_wb_congested(struct bdi_writeback_congested *congested, int sync) { - enum bdi_state bit; + enum wb_state bit; - bit = sync ? BDI_sync_congested : BDI_async_congested; - if (!test_and_set_bit(bit, &bdi->state)) - atomic_inc(&nr_bdi_congested[sync]); + bit = sync ? WB_sync_congested : WB_async_congested; + if (!test_and_set_bit(bit, &congested->state)) + atomic_inc(&nr_wb_congested[sync]); } -EXPORT_SYMBOL(set_bdi_congested); +EXPORT_SYMBOL(set_wb_congested); /** * congestion_wait - wait for a backing_dev to become uncongested @@ -555,7 +969,7 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout) * encountered in the current zone, yield if necessary instead * of sleeping on the congestion queue */ - if (atomic_read(&nr_bdi_congested[sync]) == 0 || + if (atomic_read(&nr_wb_congested[sync]) == 0 || !test_bit(ZONE_CONGESTED, &zone->flags)) { cond_resched(); diff --git a/mm/fadvise.c b/mm/fadvise.c index 4a3907cf79f8..b8a5bc66b0c0 100644 --- a/mm/fadvise.c +++ b/mm/fadvise.c @@ -115,7 +115,7 @@ SYSCALL_DEFINE4(fadvise64_64, int, fd, loff_t, offset, loff_t, len, int, advice) case POSIX_FADV_NOREUSE: break; case POSIX_FADV_DONTNEED: - if (!bdi_write_congested(bdi)) + if (!inode_write_congested(mapping->host)) __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE); diff --git a/mm/filemap.c b/mm/filemap.c index 1ffef05f1c1f..803f46761fe8 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -100,6 +100,7 @@ * ->tree_lock (page_remove_rmap->set_page_dirty) * bdi.wb->list_lock (page_remove_rmap->set_page_dirty) * ->inode->i_lock (page_remove_rmap->set_page_dirty) + * ->memcg->move_lock (page_remove_rmap->mem_cgroup_begin_page_stat) * bdi.wb->list_lock (zap_pte_range->set_page_dirty) * ->inode->i_lock (zap_pte_range->set_page_dirty) * ->private_lock (zap_pte_range->__set_page_dirty_buffers) @@ -174,9 +175,11 @@ static void page_cache_tree_delete(struct address_space *mapping, /* * Delete a page from the page cache and free it. Caller has to make * sure the page is locked and that nobody else uses it - or that usage - * is safe. The caller must hold the mapping's tree_lock. + * is safe. The caller must hold the mapping's tree_lock and + * mem_cgroup_begin_page_stat(). */ -void __delete_from_page_cache(struct page *page, void *shadow) +void __delete_from_page_cache(struct page *page, void *shadow, + struct mem_cgroup *memcg) { struct address_space *mapping = page->mapping; @@ -210,7 +213,8 @@ void __delete_from_page_cache(struct page *page, void *shadow) * anyway will be cleared before returning page into buddy allocator. */ if (WARN_ON_ONCE(PageDirty(page))) - account_page_cleaned(page, mapping); + account_page_cleaned(page, mapping, memcg, + inode_to_wb(mapping->host)); } /** @@ -224,14 +228,20 @@ void __delete_from_page_cache(struct page *page, void *shadow) void delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + struct mem_cgroup *memcg; + unsigned long flags; + void (*freepage)(struct page *); BUG_ON(!PageLocked(page)); freepage = mapping->a_ops->freepage; - spin_lock_irq(&mapping->tree_lock); - __delete_from_page_cache(page, NULL); - spin_unlock_irq(&mapping->tree_lock); + + memcg = mem_cgroup_begin_page_stat(page); + spin_lock_irqsave(&mapping->tree_lock, flags); + __delete_from_page_cache(page, NULL, memcg); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); if (freepage) freepage(page); @@ -281,7 +291,9 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, if (!mapping_cap_writeback_dirty(mapping)) return 0; + wbc_attach_fdatawrite_inode(&wbc, mapping->host); ret = do_writepages(mapping, &wbc); + wbc_detach_inode(&wbc); return ret; } @@ -470,6 +482,8 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) if (!error) { struct address_space *mapping = old->mapping; void (*freepage)(struct page *); + struct mem_cgroup *memcg; + unsigned long flags; pgoff_t offset = old->index; freepage = mapping->a_ops->freepage; @@ -478,15 +492,17 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask) new->mapping = mapping; new->index = offset; - spin_lock_irq(&mapping->tree_lock); - __delete_from_page_cache(old, NULL); + memcg = mem_cgroup_begin_page_stat(old); + spin_lock_irqsave(&mapping->tree_lock, flags); + __delete_from_page_cache(old, NULL, memcg); error = radix_tree_insert(&mapping->page_tree, offset, new); BUG_ON(error); mapping->nrpages++; __inc_zone_page_state(new, NR_FILE_PAGES); if (PageSwapBacked(new)) __inc_zone_page_state(new, NR_SHMEM); - spin_unlock_irq(&mapping->tree_lock); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); mem_cgroup_migrate(old, new, true); radix_tree_preload_end(); if (freepage) diff --git a/mm/madvise.c b/mm/madvise.c index d551475517bf..64bb8a22110c 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -17,6 +17,7 @@ #include <linux/fs.h> #include <linux/file.h> #include <linux/blkdev.h> +#include <linux/backing-dev.h> #include <linux/swap.h> #include <linux/swapops.h> diff --git a/mm/memcontrol.c b/mm/memcontrol.c index aac1c98a9bc7..bd05db663965 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -77,6 +77,7 @@ EXPORT_SYMBOL(memory_cgrp_subsys); #define MEM_CGROUP_RECLAIM_RETRIES 5 static struct mem_cgroup *root_mem_cgroup __read_mostly; +struct cgroup_subsys_state *mem_cgroup_root_css __read_mostly; /* Whether the swap controller is active */ #ifdef CONFIG_MEMCG_SWAP @@ -90,6 +91,7 @@ static const char * const mem_cgroup_stat_names[] = { "rss", "rss_huge", "mapped_file", + "dirty", "writeback", "swap", }; @@ -322,11 +324,6 @@ struct mem_cgroup { * percpu counter. */ struct mem_cgroup_stat_cpu __percpu *stat; - /* - * used when a cpu is offlined or other synchronizations - * See mem_cgroup_read_stat(). - */ - struct mem_cgroup_stat_cpu nocpu_base; spinlock_t pcp_counter_lock; #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_INET) @@ -346,6 +343,11 @@ struct mem_cgroup { atomic_t numainfo_updating; #endif +#ifdef CONFIG_CGROUP_WRITEBACK + struct list_head cgwb_list; + struct wb_domain cgwb_domain; +#endif + /* List of events which userspace want to receive */ struct list_head event_list; spinlock_t event_list_lock; @@ -596,6 +598,39 @@ struct cgroup_subsys_state *mem_cgroup_css(struct mem_cgroup *memcg) return &memcg->css; } +/** + * mem_cgroup_css_from_page - css of the memcg associated with a page + * @page: page of interest + * + * If memcg is bound to the default hierarchy, css of the memcg associated + * with @page is returned. The returned css remains associated with @page + * until it is released. + * + * If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup + * is returned. + * + * XXX: The above description of behavior on the default hierarchy isn't + * strictly true yet as replace_page_cache_page() can modify the + * association before @page is released even on the default hierarchy; + * however, the current and planned usages don't mix the the two functions + * and replace_page_cache_page() will soon be updated to make the invariant + * actually true. + */ +struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page) +{ + struct mem_cgroup *memcg; + + rcu_read_lock(); + + memcg = page->mem_cgroup; + + if (!memcg || !cgroup_on_dfl(memcg->css.cgroup)) + memcg = root_mem_cgroup; + + rcu_read_unlock(); + return &memcg->css; +} + static struct mem_cgroup_per_zone * mem_cgroup_page_zoneinfo(struct mem_cgroup *memcg, struct page *page) { @@ -795,15 +830,8 @@ static long mem_cgroup_read_stat(struct mem_cgroup *memcg, long val = 0; int cpu; - get_online_cpus(); - for_each_online_cpu(cpu) + for_each_possible_cpu(cpu) val += per_cpu(memcg->stat->count[idx], cpu); -#ifdef CONFIG_HOTPLUG_CPU - spin_lock(&memcg->pcp_counter_lock); - val += memcg->nocpu_base.count[idx]; - spin_unlock(&memcg->pcp_counter_lock); -#endif - put_online_cpus(); return val; } @@ -813,15 +841,8 @@ static unsigned long mem_cgroup_read_events(struct mem_cgroup *memcg, unsigned long val = 0; int cpu; - get_online_cpus(); - for_each_online_cpu(cpu) + for_each_possible_cpu(cpu) val += per_cpu(memcg->stat->events[idx], cpu); -#ifdef CONFIG_HOTPLUG_CPU - spin_lock(&memcg->pcp_counter_lock); - val += memcg->nocpu_base.events[idx]; - spin_unlock(&memcg->pcp_counter_lock); -#endif - put_online_cpus(); return val; } @@ -2011,6 +2032,7 @@ again: return memcg; } +EXPORT_SYMBOL(mem_cgroup_begin_page_stat); /** * mem_cgroup_end_page_stat - finish a page state statistics transaction @@ -2029,6 +2051,7 @@ void mem_cgroup_end_page_stat(struct mem_cgroup *memcg) rcu_read_unlock(); } +EXPORT_SYMBOL(mem_cgroup_end_page_stat); /** * mem_cgroup_update_page_stat - update page state statistics @@ -2169,37 +2192,12 @@ static void drain_all_stock(struct mem_cgroup *root_memcg) mutex_unlock(&percpu_charge_mutex); } -/* - * This function drains percpu counter value from DEAD cpu and - * move it to local cpu. Note that this function can be preempted. - */ -static void mem_cgroup_drain_pcp_counter(struct mem_cgroup *memcg, int cpu) -{ - int i; - - spin_lock(&memcg->pcp_counter_lock); - for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) { - long x = per_cpu(memcg->stat->count[i], cpu); - - per_cpu(memcg->stat->count[i], cpu) = 0; - memcg->nocpu_base.count[i] += x; - } - for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++) { - unsigned long x = per_cpu(memcg->stat->events[i], cpu); - - per_cpu(memcg->stat->events[i], cpu) = 0; - memcg->nocpu_base.events[i] += x; - } - spin_unlock(&memcg->pcp_counter_lock); -} - static int memcg_cpu_hotplug_callback(struct notifier_block *nb, unsigned long action, void *hcpu) { int cpu = (unsigned long)hcpu; struct memcg_stock_pcp *stock; - struct mem_cgroup *iter; if (action == CPU_ONLINE) return NOTIFY_OK; @@ -2207,9 +2205,6 @@ static int memcg_cpu_hotplug_callback(struct notifier_block *nb, if (action != CPU_DEAD && action != CPU_DEAD_FROZEN) return NOTIFY_OK; - for_each_mem_cgroup(iter) - mem_cgroup_drain_pcp_counter(iter, cpu); - stock = &per_cpu(memcg_stock, cpu); drain_stock(stock); return NOTIFY_OK; @@ -3997,6 +3992,98 @@ static void memcg_destroy_kmem(struct mem_cgroup *memcg) } #endif +#ifdef CONFIG_CGROUP_WRITEBACK + +struct list_head *mem_cgroup_cgwb_list(struct mem_cgroup *memcg) +{ + return &memcg->cgwb_list; +} + +static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) +{ + return wb_domain_init(&memcg->cgwb_domain, gfp); +} + +static void memcg_wb_domain_exit(struct mem_cgroup *memcg) +{ + wb_domain_exit(&memcg->cgwb_domain); +} + +static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) +{ + wb_domain_size_changed(&memcg->cgwb_domain); +} + +struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + + if (!memcg->css.parent) + return NULL; + + return &memcg->cgwb_domain; +} + +/** + * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg + * @wb: bdi_writeback in question + * @pavail: out parameter for number of available pages + * @pdirty: out parameter for number of dirty pages + * @pwriteback: out parameter for number of pages under writeback + * + * Determine the numbers of available, dirty, and writeback pages in @wb's + * memcg. Dirty and writeback are self-explanatory. Available is a bit + * more involved. + * + * A memcg's headroom is "min(max, high) - used". The available memory is + * calculated as the lowest headroom of itself and the ancestors plus the + * number of pages already being used for file pages. Note that this + * doesn't consider the actual amount of available memory in the system. + * The caller should further cap *@pavail accordingly. + */ +void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pavail, + unsigned long *pdirty, unsigned long *pwriteback) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); + struct mem_cgroup *parent; + unsigned long head_room = PAGE_COUNTER_MAX; + unsigned long file_pages; + + *pdirty = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_DIRTY); + + /* this should eventually include NR_UNSTABLE_NFS */ + *pwriteback = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_WRITEBACK); + + file_pages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | + (1 << LRU_ACTIVE_FILE)); + while ((parent = parent_mem_cgroup(memcg))) { + unsigned long ceiling = min(memcg->memory.limit, memcg->high); + unsigned long used = page_counter_read(&memcg->memory); + + head_room = min(head_room, ceiling - min(ceiling, used)); + memcg = parent; + } + + *pavail = file_pages + head_room; +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +static int memcg_wb_domain_init(struct mem_cgroup *memcg, gfp_t gfp) +{ + return 0; +} + +static void memcg_wb_domain_exit(struct mem_cgroup *memcg) +{ +} + +static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg) +{ +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + /* * DO NOT USE IN NEW FILES. * @@ -4381,9 +4468,15 @@ static struct mem_cgroup *mem_cgroup_alloc(void) memcg->stat = alloc_percpu(struct mem_cgroup_stat_cpu); if (!memcg->stat) goto out_free; + + if (memcg_wb_domain_init(memcg, GFP_KERNEL)) + goto out_free_stat; + spin_lock_init(&memcg->pcp_counter_lock); return memcg; +out_free_stat: + free_percpu(memcg->stat); out_free: kfree(memcg); return NULL; @@ -4410,6 +4503,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg) free_mem_cgroup_per_zone_info(memcg, node); free_percpu(memcg->stat); + memcg_wb_domain_exit(memcg); kfree(memcg); } @@ -4442,6 +4536,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) /* root ? */ if (parent_css == NULL) { root_mem_cgroup = memcg; + mem_cgroup_root_css = &memcg->css; page_counter_init(&memcg->memory, NULL); memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; @@ -4460,7 +4555,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) #ifdef CONFIG_MEMCG_KMEM memcg->kmemcg_id = -1; #endif - +#ifdef CONFIG_CGROUP_WRITEBACK + INIT_LIST_HEAD(&memcg->cgwb_list); +#endif return &memcg->css; free_out: @@ -4548,6 +4645,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) vmpressure_cleanup(&memcg->vmpressure); memcg_deactivate_kmem(memcg); + + wb_memcg_offline(memcg); } static void mem_cgroup_css_free(struct cgroup_subsys_state *css) @@ -4581,6 +4680,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css) memcg->low = 0; memcg->high = PAGE_COUNTER_MAX; memcg->soft_limit = PAGE_COUNTER_MAX; + memcg_wb_domain_size_changed(memcg); } #ifdef CONFIG_MMU @@ -4750,6 +4850,7 @@ static int mem_cgroup_move_account(struct page *page, { unsigned long flags; int ret; + bool anon; VM_BUG_ON(from == to); VM_BUG_ON_PAGE(PageLRU(page), page); @@ -4775,15 +4876,33 @@ static int mem_cgroup_move_account(struct page *page, if (page->mem_cgroup != from) goto out_unlock; + anon = PageAnon(page); + spin_lock_irqsave(&from->move_lock, flags); - if (!PageAnon(page) && page_mapped(page)) { + if (!anon && page_mapped(page)) { __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], nr_pages); __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED], nr_pages); } + /* + * move_lock grabbed above and caller set from->moving_account, so + * mem_cgroup_update_page_stat() will serialize updates to PageDirty. + * So mapping should be stable for dirty pages. + */ + if (!anon && PageDirty(page)) { + struct address_space *mapping = page_mapping(page); + + if (mapping_cap_account_dirty(mapping)) { + __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_DIRTY], + nr_pages); + __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY], + nr_pages); + } + } + if (PageWriteback(page)) { __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK], nr_pages); @@ -5299,6 +5418,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of, memcg->high = high; + memcg_wb_domain_size_changed(memcg); return nbytes; } @@ -5331,6 +5451,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, if (err) return err; + memcg_wb_domain_size_changed(memcg); return nbytes; } diff --git a/mm/page-writeback.c b/mm/page-writeback.c index eb59f7eea508..33c2bf721e97 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -122,31 +122,31 @@ EXPORT_SYMBOL(laptop_mode); /* End of sysctl-exported parameters */ -unsigned long global_dirty_limit; +struct wb_domain global_wb_domain; -/* - * Scale the writeback cache size proportional to the relative writeout speeds. - * - * We do this by keeping a floating proportion between BDIs, based on page - * writeback completions [end_page_writeback()]. Those devices that write out - * pages fastest will get the larger share, while the slower will get a smaller - * share. - * - * We use page writeout completions because we are interested in getting rid of - * dirty pages. Having them written out is the primary goal. - * - * We introduce a concept of time, a period over which we measure these events, - * because demand can/will vary over time. The length of this period itself is - * measured in page writeback completions. - * - */ -static struct fprop_global writeout_completions; +/* consolidated parameters for balance_dirty_pages() and its subroutines */ +struct dirty_throttle_control { +#ifdef CONFIG_CGROUP_WRITEBACK + struct wb_domain *dom; + struct dirty_throttle_control *gdtc; /* only set in memcg dtc's */ +#endif + struct bdi_writeback *wb; + struct fprop_local_percpu *wb_completions; -static void writeout_period(unsigned long t); -/* Timer for aging of writeout_completions */ -static struct timer_list writeout_period_timer = - TIMER_DEFERRED_INITIALIZER(writeout_period, 0, 0); -static unsigned long writeout_period_time = 0; + unsigned long avail; /* dirtyable */ + unsigned long dirty; /* file_dirty + write + nfs */ + unsigned long thresh; /* dirty threshold */ + unsigned long bg_thresh; /* dirty background threshold */ + + unsigned long wb_dirty; /* per-wb counterparts */ + unsigned long wb_thresh; + unsigned long wb_bg_thresh; + + unsigned long pos_ratio; +}; + +#define DTC_INIT_COMMON(__wb) .wb = (__wb), \ + .wb_completions = &(__wb)->completions /* * Length of period for aging writeout fractions of bdis. This is an @@ -155,6 +155,97 @@ static unsigned long writeout_period_time = 0; */ #define VM_COMPLETIONS_PERIOD_LEN (3*HZ) +#ifdef CONFIG_CGROUP_WRITEBACK + +#define GDTC_INIT(__wb) .dom = &global_wb_domain, \ + DTC_INIT_COMMON(__wb) +#define GDTC_INIT_NO_WB .dom = &global_wb_domain +#define MDTC_INIT(__wb, __gdtc) .dom = mem_cgroup_wb_domain(__wb), \ + .gdtc = __gdtc, \ + DTC_INIT_COMMON(__wb) + +static bool mdtc_valid(struct dirty_throttle_control *dtc) +{ + return dtc->dom; +} + +static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) +{ + return dtc->dom; +} + +static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc) +{ + return mdtc->gdtc; +} + +static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb) +{ + return &wb->memcg_completions; +} + +static void wb_min_max_ratio(struct bdi_writeback *wb, + unsigned long *minp, unsigned long *maxp) +{ + unsigned long this_bw = wb->avg_write_bandwidth; + unsigned long tot_bw = atomic_long_read(&wb->bdi->tot_write_bandwidth); + unsigned long long min = wb->bdi->min_ratio; + unsigned long long max = wb->bdi->max_ratio; + + /* + * @wb may already be clean by the time control reaches here and + * the total may not include its bw. + */ + if (this_bw < tot_bw) { + if (min) { + min *= this_bw; + do_div(min, tot_bw); + } + if (max < 100) { + max *= this_bw; + do_div(max, tot_bw); + } + } + + *minp = min; + *maxp = max; +} + +#else /* CONFIG_CGROUP_WRITEBACK */ + +#define GDTC_INIT(__wb) DTC_INIT_COMMON(__wb) +#define GDTC_INIT_NO_WB +#define MDTC_INIT(__wb, __gdtc) + +static bool mdtc_valid(struct dirty_throttle_control *dtc) +{ + return false; +} + +static struct wb_domain *dtc_dom(struct dirty_throttle_control *dtc) +{ + return &global_wb_domain; +} + +static struct dirty_throttle_control *mdtc_gdtc(struct dirty_throttle_control *mdtc) +{ + return NULL; +} + +static struct fprop_local_percpu *wb_memcg_completions(struct bdi_writeback *wb) +{ + return NULL; +} + +static void wb_min_max_ratio(struct bdi_writeback *wb, + unsigned long *minp, unsigned long *maxp) +{ + *minp = wb->bdi->min_ratio; + *maxp = wb->bdi->max_ratio; +} + +#endif /* CONFIG_CGROUP_WRITEBACK */ + /* * In a memory zone, there is a certain amount of pages we consider * available for the page cache, which is essentially the number of @@ -250,42 +341,88 @@ static unsigned long global_dirtyable_memory(void) return x + 1; /* Ensure that we never return 0 */ } -/* - * global_dirty_limits - background-writeback and dirty-throttling thresholds +/** + * domain_dirty_limits - calculate thresh and bg_thresh for a wb_domain + * @dtc: dirty_throttle_control of interest * - * Calculate the dirty thresholds based on sysctl parameters - * - vm.dirty_background_ratio or vm.dirty_background_bytes - * - vm.dirty_ratio or vm.dirty_bytes - * The dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and + * Calculate @dtc->thresh and ->bg_thresh considering + * vm_dirty_{bytes|ratio} and dirty_background_{bytes|ratio}. The caller + * must ensure that @dtc->avail is set before calling this function. The + * dirty limits will be lifted by 1/4 for PF_LESS_THROTTLE (ie. nfsd) and * real-time tasks. */ -void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty) +static void domain_dirty_limits(struct dirty_throttle_control *dtc) { - const unsigned long available_memory = global_dirtyable_memory(); - unsigned long background; - unsigned long dirty; + const unsigned long available_memory = dtc->avail; + struct dirty_throttle_control *gdtc = mdtc_gdtc(dtc); + unsigned long bytes = vm_dirty_bytes; + unsigned long bg_bytes = dirty_background_bytes; + unsigned long ratio = vm_dirty_ratio; + unsigned long bg_ratio = dirty_background_ratio; + unsigned long thresh; + unsigned long bg_thresh; struct task_struct *tsk; - if (vm_dirty_bytes) - dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); + /* gdtc is !NULL iff @dtc is for memcg domain */ + if (gdtc) { + unsigned long global_avail = gdtc->avail; + + /* + * The byte settings can't be applied directly to memcg + * domains. Convert them to ratios by scaling against + * globally available memory. + */ + if (bytes) + ratio = min(DIV_ROUND_UP(bytes, PAGE_SIZE) * 100 / + global_avail, 100UL); + if (bg_bytes) + bg_ratio = min(DIV_ROUND_UP(bg_bytes, PAGE_SIZE) * 100 / + global_avail, 100UL); + bytes = bg_bytes = 0; + } + + if (bytes) + thresh = DIV_ROUND_UP(bytes, PAGE_SIZE); else - dirty = (vm_dirty_ratio * available_memory) / 100; + thresh = (ratio * available_memory) / 100; - if (dirty_background_bytes) - background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE); + if (bg_bytes) + bg_thresh = DIV_ROUND_UP(bg_bytes, PAGE_SIZE); else - background = (dirty_background_ratio * available_memory) / 100; + bg_thresh = (bg_ratio * available_memory) / 100; - if (background >= dirty) - background = dirty / 2; + if (bg_thresh >= thresh) + bg_thresh = thresh / 2; tsk = current; if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { - background += background / 4; - dirty += dirty / 4; + bg_thresh += bg_thresh / 4; + thresh += thresh / 4; } - *pbackground = background; - *pdirty = dirty; - trace_global_dirty_state(background, dirty); + dtc->thresh = thresh; + dtc->bg_thresh = bg_thresh; + + /* we should eventually report the domain in the TP */ + if (!gdtc) + trace_global_dirty_state(bg_thresh, thresh); +} + +/** + * global_dirty_limits - background-writeback and dirty-throttling thresholds + * @pbackground: out parameter for bg_thresh + * @pdirty: out parameter for thresh + * + * Calculate bg_thresh and thresh for global_wb_domain. See + * domain_dirty_limits() for details. + */ +void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty) +{ + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB }; + + gdtc.avail = global_dirtyable_memory(); + domain_dirty_limits(&gdtc); + + *pbackground = gdtc.bg_thresh; + *pdirty = gdtc.thresh; } /** @@ -392,47 +529,52 @@ static unsigned long wp_next_time(unsigned long cur_time) return cur_time; } -/* - * Increment the BDI's writeout completion count and the global writeout - * completion count. Called from test_clear_page_writeback(). - */ -static inline void __bdi_writeout_inc(struct backing_dev_info *bdi) +static void wb_domain_writeout_inc(struct wb_domain *dom, + struct fprop_local_percpu *completions, + unsigned int max_prop_frac) { - __inc_bdi_stat(bdi, BDI_WRITTEN); - __fprop_inc_percpu_max(&writeout_completions, &bdi->completions, - bdi->max_prop_frac); + __fprop_inc_percpu_max(&dom->completions, completions, + max_prop_frac); /* First event after period switching was turned off? */ - if (!unlikely(writeout_period_time)) { + if (!unlikely(dom->period_time)) { /* * We can race with other __bdi_writeout_inc calls here but * it does not cause any harm since the resulting time when * timer will fire and what is in writeout_period_time will be * roughly the same. */ - writeout_period_time = wp_next_time(jiffies); - mod_timer(&writeout_period_timer, writeout_period_time); + dom->period_time = wp_next_time(jiffies); + mod_timer(&dom->period_timer, dom->period_time); } } -void bdi_writeout_inc(struct backing_dev_info *bdi) +/* + * Increment @wb's writeout completion count and the global writeout + * completion count. Called from test_clear_page_writeback(). + */ +static inline void __wb_writeout_inc(struct bdi_writeback *wb) { - unsigned long flags; + struct wb_domain *cgdom; - local_irq_save(flags); - __bdi_writeout_inc(bdi); - local_irq_restore(flags); + __inc_wb_stat(wb, WB_WRITTEN); + wb_domain_writeout_inc(&global_wb_domain, &wb->completions, + wb->bdi->max_prop_frac); + + cgdom = mem_cgroup_wb_domain(wb); + if (cgdom) + wb_domain_writeout_inc(cgdom, wb_memcg_completions(wb), + wb->bdi->max_prop_frac); } -EXPORT_SYMBOL_GPL(bdi_writeout_inc); -/* - * Obtain an accurate fraction of the BDI's portion. - */ -static void bdi_writeout_fraction(struct backing_dev_info *bdi, - long *numerator, long *denominator) +void wb_writeout_inc(struct bdi_writeback *wb) { - fprop_fraction_percpu(&writeout_completions, &bdi->completions, - numerator, denominator); + unsigned long flags; + + local_irq_save(flags); + __wb_writeout_inc(wb); + local_irq_restore(flags); } +EXPORT_SYMBOL_GPL(wb_writeout_inc); /* * On idle system, we can be called long after we scheduled because we use @@ -440,22 +582,46 @@ static void bdi_writeout_fraction(struct backing_dev_info *bdi, */ static void writeout_period(unsigned long t) { - int miss_periods = (jiffies - writeout_period_time) / + struct wb_domain *dom = (void *)t; + int miss_periods = (jiffies - dom->period_time) / VM_COMPLETIONS_PERIOD_LEN; - if (fprop_new_period(&writeout_completions, miss_periods + 1)) { - writeout_period_time = wp_next_time(writeout_period_time + + if (fprop_new_period(&dom->completions, miss_periods + 1)) { + dom->period_time = wp_next_time(dom->period_time + miss_periods * VM_COMPLETIONS_PERIOD_LEN); - mod_timer(&writeout_period_timer, writeout_period_time); + mod_timer(&dom->period_timer, dom->period_time); } else { /* * Aging has zeroed all fractions. Stop wasting CPU on period * updates. */ - writeout_period_time = 0; + dom->period_time = 0; } } +int wb_domain_init(struct wb_domain *dom, gfp_t gfp) +{ + memset(dom, 0, sizeof(*dom)); + + spin_lock_init(&dom->lock); + + init_timer_deferrable(&dom->period_timer); + dom->period_timer.function = writeout_period; + dom->period_timer.data = (unsigned long)dom; + + dom->dirty_limit_tstamp = jiffies; + + return fprop_global_init(&dom->completions, gfp); +} + +#ifdef CONFIG_CGROUP_WRITEBACK +void wb_domain_exit(struct wb_domain *dom) +{ + del_timer_sync(&dom->period_timer); + fprop_global_destroy(&dom->completions); +} +#endif + /* * bdi_min_ratio keeps the sum of the minimum dirty shares of all * registered backing devices, which, for obvious reasons, can not @@ -510,17 +676,26 @@ static unsigned long dirty_freerun_ceiling(unsigned long thresh, return (thresh + bg_thresh) / 2; } -static unsigned long hard_dirty_limit(unsigned long thresh) +static unsigned long hard_dirty_limit(struct wb_domain *dom, + unsigned long thresh) { - return max(thresh, global_dirty_limit); + return max(thresh, dom->dirty_limit); +} + +/* memory available to a memcg domain is capped by system-wide clean memory */ +static void mdtc_cap_avail(struct dirty_throttle_control *mdtc) +{ + struct dirty_throttle_control *gdtc = mdtc_gdtc(mdtc); + unsigned long clean = gdtc->avail - min(gdtc->avail, gdtc->dirty); + + mdtc->avail = min(mdtc->avail, clean); } /** - * bdi_dirty_limit - @bdi's share of dirty throttling threshold - * @bdi: the backing_dev_info to query - * @dirty: global dirty limit in pages + * __wb_calc_thresh - @wb's share of dirty throttling threshold + * @dtc: dirty_throttle_context of interest * - * Returns @bdi's dirty limit in pages. The term "dirty" in the context of + * Returns @wb's dirty limit in pages. The term "dirty" in the context of * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages. * * Note that balance_dirty_pages() will only seriously take it as a hard limit @@ -528,34 +703,47 @@ static unsigned long hard_dirty_limit(unsigned long thresh) * control. For example, when the device is completely stalled due to some error * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key. * In the other normal situations, it acts more gently by throttling the tasks - * more (rather than completely block them) when the bdi dirty pages go high. + * more (rather than completely block them) when the wb dirty pages go high. * * It allocates high/low dirty limits to fast/slow devices, in order to prevent * - starving fast devices * - piling up dirty pages (that will take long time to sync) on slow devices * - * The bdi's share of dirty limit will be adapting to its throughput and + * The wb's share of dirty limit will be adapting to its throughput and * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set. */ -unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty) +static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc) { - u64 bdi_dirty; + struct wb_domain *dom = dtc_dom(dtc); + unsigned long thresh = dtc->thresh; + u64 wb_thresh; long numerator, denominator; + unsigned long wb_min_ratio, wb_max_ratio; /* - * Calculate this BDI's share of the dirty ratio. + * Calculate this BDI's share of the thresh ratio. */ - bdi_writeout_fraction(bdi, &numerator, &denominator); + fprop_fraction_percpu(&dom->completions, dtc->wb_completions, + &numerator, &denominator); + + wb_thresh = (thresh * (100 - bdi_min_ratio)) / 100; + wb_thresh *= numerator; + do_div(wb_thresh, denominator); - bdi_dirty = (dirty * (100 - bdi_min_ratio)) / 100; - bdi_dirty *= numerator; - do_div(bdi_dirty, denominator); + wb_min_max_ratio(dtc->wb, &wb_min_ratio, &wb_max_ratio); - bdi_dirty += (dirty * bdi->min_ratio) / 100; - if (bdi_dirty > (dirty * bdi->max_ratio) / 100) - bdi_dirty = dirty * bdi->max_ratio / 100; + wb_thresh += (thresh * wb_min_ratio) / 100; + if (wb_thresh > (thresh * wb_max_ratio) / 100) + wb_thresh = thresh * wb_max_ratio / 100; - return bdi_dirty; + return wb_thresh; +} + +unsigned long wb_calc_thresh(struct bdi_writeback *wb, unsigned long thresh) +{ + struct dirty_throttle_control gdtc = { GDTC_INIT(wb), + .thresh = thresh }; + return __wb_calc_thresh(&gdtc); } /* @@ -594,7 +782,7 @@ static long long pos_ratio_polynom(unsigned long setpoint, * * (o) global/bdi setpoints * - * We want the dirty pages be balanced around the global/bdi setpoints. + * We want the dirty pages be balanced around the global/wb setpoints. * When the number of dirty pages is higher/lower than the setpoint, the * dirty position control ratio (and hence task dirty ratelimit) will be * decreased/increased to bring the dirty pages back to the setpoint. @@ -604,8 +792,8 @@ static long long pos_ratio_polynom(unsigned long setpoint, * if (dirty < setpoint) scale up pos_ratio * if (dirty > setpoint) scale down pos_ratio * - * if (bdi_dirty < bdi_setpoint) scale up pos_ratio - * if (bdi_dirty > bdi_setpoint) scale down pos_ratio + * if (wb_dirty < wb_setpoint) scale up pos_ratio + * if (wb_dirty > wb_setpoint) scale down pos_ratio * * task_ratelimit = dirty_ratelimit * pos_ratio >> RATELIMIT_CALC_SHIFT * @@ -630,7 +818,7 @@ static long long pos_ratio_polynom(unsigned long setpoint, * 0 +------------.------------------.----------------------*-------------> * freerun^ setpoint^ limit^ dirty pages * - * (o) bdi control line + * (o) wb control line * * ^ pos_ratio * | @@ -656,33 +844,32 @@ static long long pos_ratio_polynom(unsigned long setpoint, * | . . * | . . * 0 +----------------------.-------------------------------.-------------> - * bdi_setpoint^ x_intercept^ + * wb_setpoint^ x_intercept^ * - * The bdi control line won't drop below pos_ratio=1/4, so that bdi_dirty can + * The wb control line won't drop below pos_ratio=1/4, so that wb_dirty can * be smoothly throttled down to normal if it starts high in situations like * - start writing to a slow SD card and a fast disk at the same time. The SD - * card's bdi_dirty may rush to many times higher than bdi_setpoint. - * - the bdi dirty thresh drops quickly due to change of JBOD workload + * card's wb_dirty may rush to many times higher than wb_setpoint. + * - the wb dirty thresh drops quickly due to change of JBOD workload */ -static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty) -{ - unsigned long write_bw = bdi->avg_write_bandwidth; - unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); - unsigned long limit = hard_dirty_limit(thresh); +static void wb_position_ratio(struct dirty_throttle_control *dtc) +{ + struct bdi_writeback *wb = dtc->wb; + unsigned long write_bw = wb->avg_write_bandwidth; + unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); + unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); + unsigned long wb_thresh = dtc->wb_thresh; unsigned long x_intercept; unsigned long setpoint; /* dirty pages' target balance point */ - unsigned long bdi_setpoint; + unsigned long wb_setpoint; unsigned long span; long long pos_ratio; /* for scaling up/down the rate limit */ long x; - if (unlikely(dirty >= limit)) - return 0; + dtc->pos_ratio = 0; + + if (unlikely(dtc->dirty >= limit)) + return; /* * global setpoint @@ -690,165 +877,167 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi, * See comment for pos_ratio_polynom(). */ setpoint = (freerun + limit) / 2; - pos_ratio = pos_ratio_polynom(setpoint, dirty, limit); + pos_ratio = pos_ratio_polynom(setpoint, dtc->dirty, limit); /* * The strictlimit feature is a tool preventing mistrusted filesystems * from growing a large number of dirty pages before throttling. For - * such filesystems balance_dirty_pages always checks bdi counters - * against bdi limits. Even if global "nr_dirty" is under "freerun". + * such filesystems balance_dirty_pages always checks wb counters + * against wb limits. Even if global "nr_dirty" is under "freerun". * This is especially important for fuse which sets bdi->max_ratio to * 1% by default. Without strictlimit feature, fuse writeback may * consume arbitrary amount of RAM because it is accounted in * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty". * - * Here, in bdi_position_ratio(), we calculate pos_ratio based on - * two values: bdi_dirty and bdi_thresh. Let's consider an example: + * Here, in wb_position_ratio(), we calculate pos_ratio based on + * two values: wb_dirty and wb_thresh. Let's consider an example: * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global * limits are set by default to 10% and 20% (background and throttle). - * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages. - * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is - * about ~6K pages (as the average of background and throttle bdi + * Then wb_thresh is 1% of 20% of 16GB. This amounts to ~8K pages. + * wb_calc_thresh(wb, bg_thresh) is about ~4K pages. wb_setpoint is + * about ~6K pages (as the average of background and throttle wb * limits). The 3rd order polynomial will provide positive feedback if - * bdi_dirty is under bdi_setpoint and vice versa. + * wb_dirty is under wb_setpoint and vice versa. * * Note, that we cannot use global counters in these calculations - * because we want to throttle process writing to a strictlimit BDI + * because we want to throttle process writing to a strictlimit wb * much earlier than global "freerun" is reached (~23MB vs. ~2.3GB * in the example above). */ - if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) { - long long bdi_pos_ratio; - unsigned long bdi_bg_thresh; + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { + long long wb_pos_ratio; - if (bdi_dirty < 8) - return min_t(long long, pos_ratio * 2, - 2 << RATELIMIT_CALC_SHIFT); + if (dtc->wb_dirty < 8) { + dtc->pos_ratio = min_t(long long, pos_ratio * 2, + 2 << RATELIMIT_CALC_SHIFT); + return; + } - if (bdi_dirty >= bdi_thresh) - return 0; + if (dtc->wb_dirty >= wb_thresh) + return; - bdi_bg_thresh = div_u64((u64)bdi_thresh * bg_thresh, thresh); - bdi_setpoint = dirty_freerun_ceiling(bdi_thresh, - bdi_bg_thresh); + wb_setpoint = dirty_freerun_ceiling(wb_thresh, + dtc->wb_bg_thresh); - if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh) - return 0; + if (wb_setpoint == 0 || wb_setpoint == wb_thresh) + return; - bdi_pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty, - bdi_thresh); + wb_pos_ratio = pos_ratio_polynom(wb_setpoint, dtc->wb_dirty, + wb_thresh); /* - * Typically, for strictlimit case, bdi_setpoint << setpoint - * and pos_ratio >> bdi_pos_ratio. In the other words global + * Typically, for strictlimit case, wb_setpoint << setpoint + * and pos_ratio >> wb_pos_ratio. In the other words global * state ("dirty") is not limiting factor and we have to - * make decision based on bdi counters. But there is an + * make decision based on wb counters. But there is an * important case when global pos_ratio should get precedence: * global limits are exceeded (e.g. due to activities on other - * BDIs) while given strictlimit BDI is below limit. + * wb's) while given strictlimit wb is below limit. * - * "pos_ratio * bdi_pos_ratio" would work for the case above, + * "pos_ratio * wb_pos_ratio" would work for the case above, * but it would look too non-natural for the case of all - * activity in the system coming from a single strictlimit BDI + * activity in the system coming from a single strictlimit wb * with bdi->max_ratio == 100%. * * Note that min() below somewhat changes the dynamics of the * control system. Normally, pos_ratio value can be well over 3 - * (when globally we are at freerun and bdi is well below bdi + * (when globally we are at freerun and wb is well below wb * setpoint). Now the maximum pos_ratio in the same situation * is 2. We might want to tweak this if we observe the control * system is too slow to adapt. */ - return min(pos_ratio, bdi_pos_ratio); + dtc->pos_ratio = min(pos_ratio, wb_pos_ratio); + return; } /* * We have computed basic pos_ratio above based on global situation. If - * the bdi is over/under its share of dirty pages, we want to scale + * the wb is over/under its share of dirty pages, we want to scale * pos_ratio further down/up. That is done by the following mechanism. */ /* - * bdi setpoint + * wb setpoint * - * f(bdi_dirty) := 1.0 + k * (bdi_dirty - bdi_setpoint) + * f(wb_dirty) := 1.0 + k * (wb_dirty - wb_setpoint) * - * x_intercept - bdi_dirty + * x_intercept - wb_dirty * := -------------------------- - * x_intercept - bdi_setpoint + * x_intercept - wb_setpoint * - * The main bdi control line is a linear function that subjects to + * The main wb control line is a linear function that subjects to * - * (1) f(bdi_setpoint) = 1.0 - * (2) k = - 1 / (8 * write_bw) (in single bdi case) - * or equally: x_intercept = bdi_setpoint + 8 * write_bw + * (1) f(wb_setpoint) = 1.0 + * (2) k = - 1 / (8 * write_bw) (in single wb case) + * or equally: x_intercept = wb_setpoint + 8 * write_bw * - * For single bdi case, the dirty pages are observed to fluctuate + * For single wb case, the dirty pages are observed to fluctuate * regularly within range - * [bdi_setpoint - write_bw/2, bdi_setpoint + write_bw/2] + * [wb_setpoint - write_bw/2, wb_setpoint + write_bw/2] * for various filesystems, where (2) can yield in a reasonable 12.5% * fluctuation range for pos_ratio. * - * For JBOD case, bdi_thresh (not bdi_dirty!) could fluctuate up to its + * For JBOD case, wb_thresh (not wb_dirty!) could fluctuate up to its * own size, so move the slope over accordingly and choose a slope that - * yields 100% pos_ratio fluctuation on suddenly doubled bdi_thresh. + * yields 100% pos_ratio fluctuation on suddenly doubled wb_thresh. */ - if (unlikely(bdi_thresh > thresh)) - bdi_thresh = thresh; + if (unlikely(wb_thresh > dtc->thresh)) + wb_thresh = dtc->thresh; /* - * It's very possible that bdi_thresh is close to 0 not because the + * It's very possible that wb_thresh is close to 0 not because the * device is slow, but that it has remained inactive for long time. * Honour such devices a reasonable good (hopefully IO efficient) * threshold, so that the occasional writes won't be blocked and active * writes can rampup the threshold quickly. */ - bdi_thresh = max(bdi_thresh, (limit - dirty) / 8); + wb_thresh = max(wb_thresh, (limit - dtc->dirty) / 8); /* - * scale global setpoint to bdi's: - * bdi_setpoint = setpoint * bdi_thresh / thresh + * scale global setpoint to wb's: + * wb_setpoint = setpoint * wb_thresh / thresh */ - x = div_u64((u64)bdi_thresh << 16, thresh | 1); - bdi_setpoint = setpoint * (u64)x >> 16; + x = div_u64((u64)wb_thresh << 16, dtc->thresh + 1); + wb_setpoint = setpoint * (u64)x >> 16; /* - * Use span=(8*write_bw) in single bdi case as indicated by - * (thresh - bdi_thresh ~= 0) and transit to bdi_thresh in JBOD case. + * Use span=(8*write_bw) in single wb case as indicated by + * (thresh - wb_thresh ~= 0) and transit to wb_thresh in JBOD case. * - * bdi_thresh thresh - bdi_thresh - * span = ---------- * (8 * write_bw) + ------------------- * bdi_thresh - * thresh thresh + * wb_thresh thresh - wb_thresh + * span = --------- * (8 * write_bw) + ------------------ * wb_thresh + * thresh thresh */ - span = (thresh - bdi_thresh + 8 * write_bw) * (u64)x >> 16; - x_intercept = bdi_setpoint + span; + span = (dtc->thresh - wb_thresh + 8 * write_bw) * (u64)x >> 16; + x_intercept = wb_setpoint + span; - if (bdi_dirty < x_intercept - span / 4) { - pos_ratio = div64_u64(pos_ratio * (x_intercept - bdi_dirty), - (x_intercept - bdi_setpoint) | 1); + if (dtc->wb_dirty < x_intercept - span / 4) { + pos_ratio = div64_u64(pos_ratio * (x_intercept - dtc->wb_dirty), + x_intercept - wb_setpoint + 1); } else pos_ratio /= 4; /* - * bdi reserve area, safeguard against dirty pool underrun and disk idle + * wb reserve area, safeguard against dirty pool underrun and disk idle * It may push the desired control point of global dirty pages higher * than setpoint. */ - x_intercept = bdi_thresh / 2; - if (bdi_dirty < x_intercept) { - if (bdi_dirty > x_intercept / 8) - pos_ratio = div_u64(pos_ratio * x_intercept, bdi_dirty); + x_intercept = wb_thresh / 2; + if (dtc->wb_dirty < x_intercept) { + if (dtc->wb_dirty > x_intercept / 8) + pos_ratio = div_u64(pos_ratio * x_intercept, + dtc->wb_dirty); else pos_ratio *= 8; } - return pos_ratio; + dtc->pos_ratio = pos_ratio; } -static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, - unsigned long elapsed, - unsigned long written) +static void wb_update_write_bandwidth(struct bdi_writeback *wb, + unsigned long elapsed, + unsigned long written) { const unsigned long period = roundup_pow_of_two(3 * HZ); - unsigned long avg = bdi->avg_write_bandwidth; - unsigned long old = bdi->write_bandwidth; + unsigned long avg = wb->avg_write_bandwidth; + unsigned long old = wb->write_bandwidth; u64 bw; /* @@ -861,14 +1050,14 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, * @written may have decreased due to account_page_redirty(). * Avoid underflowing @bw calculation. */ - bw = written - min(written, bdi->written_stamp); + bw = written - min(written, wb->written_stamp); bw *= HZ; if (unlikely(elapsed > period)) { do_div(bw, elapsed); avg = bw; goto out; } - bw += (u64)bdi->write_bandwidth * (period - elapsed); + bw += (u64)wb->write_bandwidth * (period - elapsed); bw >>= ilog2(period); /* @@ -881,21 +1070,22 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi, avg += (old - avg) >> 3; out: - bdi->write_bandwidth = bw; - bdi->avg_write_bandwidth = avg; + /* keep avg > 0 to guarantee that tot > 0 if there are dirty wbs */ + avg = max(avg, 1LU); + if (wb_has_dirty_io(wb)) { + long delta = avg - wb->avg_write_bandwidth; + WARN_ON_ONCE(atomic_long_add_return(delta, + &wb->bdi->tot_write_bandwidth) <= 0); + } + wb->write_bandwidth = bw; + wb->avg_write_bandwidth = avg; } -/* - * The global dirtyable memory and dirty threshold could be suddenly knocked - * down by a large amount (eg. on the startup of KVM in a swapless system). - * This may throw the system into deep dirty exceeded state and throttle - * heavy/light dirtiers alike. To retain good responsiveness, maintain - * global_dirty_limit for tracking slowly down to the knocked down dirty - * threshold. - */ -static void update_dirty_limit(unsigned long thresh, unsigned long dirty) +static void update_dirty_limit(struct dirty_throttle_control *dtc) { - unsigned long limit = global_dirty_limit; + struct wb_domain *dom = dtc_dom(dtc); + unsigned long thresh = dtc->thresh; + unsigned long limit = dom->dirty_limit; /* * Follow up in one step. @@ -908,63 +1098,57 @@ static void update_dirty_limit(unsigned long thresh, unsigned long dirty) /* * Follow down slowly. Use the higher one as the target, because thresh * may drop below dirty. This is exactly the reason to introduce - * global_dirty_limit which is guaranteed to lie above the dirty pages. + * dom->dirty_limit which is guaranteed to lie above the dirty pages. */ - thresh = max(thresh, dirty); + thresh = max(thresh, dtc->dirty); if (limit > thresh) { limit -= (limit - thresh) >> 5; goto update; } return; update: - global_dirty_limit = limit; + dom->dirty_limit = limit; } -static void global_update_bandwidth(unsigned long thresh, - unsigned long dirty, +static void domain_update_bandwidth(struct dirty_throttle_control *dtc, unsigned long now) { - static DEFINE_SPINLOCK(dirty_lock); - static unsigned long update_time = INITIAL_JIFFIES; + struct wb_domain *dom = dtc_dom(dtc); /* * check locklessly first to optimize away locking for the most time */ - if (time_before(now, update_time + BANDWIDTH_INTERVAL)) + if (time_before(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) return; - spin_lock(&dirty_lock); - if (time_after_eq(now, update_time + BANDWIDTH_INTERVAL)) { - update_dirty_limit(thresh, dirty); - update_time = now; + spin_lock(&dom->lock); + if (time_after_eq(now, dom->dirty_limit_tstamp + BANDWIDTH_INTERVAL)) { + update_dirty_limit(dtc); + dom->dirty_limit_tstamp = now; } - spin_unlock(&dirty_lock); + spin_unlock(&dom->lock); } /* - * Maintain bdi->dirty_ratelimit, the base dirty throttle rate. + * Maintain wb->dirty_ratelimit, the base dirty throttle rate. * - * Normal bdi tasks will be curbed at or below it in long term. + * Normal wb tasks will be curbed at or below it in long term. * Obviously it should be around (write_bw / N) when there are N dd tasks. */ -static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, - unsigned long dirtied, - unsigned long elapsed) -{ - unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh); - unsigned long limit = hard_dirty_limit(thresh); +static void wb_update_dirty_ratelimit(struct dirty_throttle_control *dtc, + unsigned long dirtied, + unsigned long elapsed) +{ + struct bdi_writeback *wb = dtc->wb; + unsigned long dirty = dtc->dirty; + unsigned long freerun = dirty_freerun_ceiling(dtc->thresh, dtc->bg_thresh); + unsigned long limit = hard_dirty_limit(dtc_dom(dtc), dtc->thresh); unsigned long setpoint = (freerun + limit) / 2; - unsigned long write_bw = bdi->avg_write_bandwidth; - unsigned long dirty_ratelimit = bdi->dirty_ratelimit; + unsigned long write_bw = wb->avg_write_bandwidth; + unsigned long dirty_ratelimit = wb->dirty_ratelimit; unsigned long dirty_rate; unsigned long task_ratelimit; unsigned long balanced_dirty_ratelimit; - unsigned long pos_ratio; unsigned long step; unsigned long x; @@ -972,20 +1156,18 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, * The dirty rate will match the writeout rate in long term, except * when dirty pages are truncated by userspace or re-dirtied by FS. */ - dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed; + dirty_rate = (dirtied - wb->dirtied_stamp) * HZ / elapsed; - pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty); /* * task_ratelimit reflects each dd's dirty rate for the past 200ms. */ task_ratelimit = (u64)dirty_ratelimit * - pos_ratio >> RATELIMIT_CALC_SHIFT; + dtc->pos_ratio >> RATELIMIT_CALC_SHIFT; task_ratelimit++; /* it helps rampup dirty_ratelimit from tiny values */ /* * A linear estimation of the "balanced" throttle rate. The theory is, - * if there are N dd tasks, each throttled at task_ratelimit, the bdi's + * if there are N dd tasks, each throttled at task_ratelimit, the wb's * dirty_rate will be measured to be (N * task_ratelimit). So the below * formula will yield the balanced rate limit (write_bw / N). * @@ -1024,7 +1206,7 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, /* * We could safely do this and return immediately: * - * bdi->dirty_ratelimit = balanced_dirty_ratelimit; + * wb->dirty_ratelimit = balanced_dirty_ratelimit; * * However to get a more stable dirty_ratelimit, the below elaborated * code makes use of task_ratelimit to filter out singular points and @@ -1058,32 +1240,31 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, step = 0; /* - * For strictlimit case, calculations above were based on bdi counters - * and limits (starting from pos_ratio = bdi_position_ratio() and up to + * For strictlimit case, calculations above were based on wb counters + * and limits (starting from pos_ratio = wb_position_ratio() and up to * balanced_dirty_ratelimit = task_ratelimit * write_bw / dirty_rate). - * Hence, to calculate "step" properly, we have to use bdi_dirty as - * "dirty" and bdi_setpoint as "setpoint". + * Hence, to calculate "step" properly, we have to use wb_dirty as + * "dirty" and wb_setpoint as "setpoint". * - * We rampup dirty_ratelimit forcibly if bdi_dirty is low because - * it's possible that bdi_thresh is close to zero due to inactivity - * of backing device (see the implementation of bdi_dirty_limit()). + * We rampup dirty_ratelimit forcibly if wb_dirty is low because + * it's possible that wb_thresh is close to zero due to inactivity + * of backing device. */ - if (unlikely(bdi->capabilities & BDI_CAP_STRICTLIMIT)) { - dirty = bdi_dirty; - if (bdi_dirty < 8) - setpoint = bdi_dirty + 1; + if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) { + dirty = dtc->wb_dirty; + if (dtc->wb_dirty < 8) + setpoint = dtc->wb_dirty + 1; else - setpoint = (bdi_thresh + - bdi_dirty_limit(bdi, bg_thresh)) / 2; + setpoint = (dtc->wb_thresh + dtc->wb_bg_thresh) / 2; } if (dirty < setpoint) { - x = min3(bdi->balanced_dirty_ratelimit, + x = min3(wb->balanced_dirty_ratelimit, balanced_dirty_ratelimit, task_ratelimit); if (dirty_ratelimit < x) step = x - dirty_ratelimit; } else { - x = max3(bdi->balanced_dirty_ratelimit, + x = max3(wb->balanced_dirty_ratelimit, balanced_dirty_ratelimit, task_ratelimit); if (dirty_ratelimit > x) step = dirty_ratelimit - x; @@ -1105,69 +1286,67 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi, else dirty_ratelimit -= step; - bdi->dirty_ratelimit = max(dirty_ratelimit, 1UL); - bdi->balanced_dirty_ratelimit = balanced_dirty_ratelimit; + wb->dirty_ratelimit = max(dirty_ratelimit, 1UL); + wb->balanced_dirty_ratelimit = balanced_dirty_ratelimit; - trace_bdi_dirty_ratelimit(bdi, dirty_rate, task_ratelimit); + trace_bdi_dirty_ratelimit(wb->bdi, dirty_rate, task_ratelimit); } -void __bdi_update_bandwidth(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, - unsigned long start_time) +static void __wb_update_bandwidth(struct dirty_throttle_control *gdtc, + struct dirty_throttle_control *mdtc, + unsigned long start_time, + bool update_ratelimit) { + struct bdi_writeback *wb = gdtc->wb; unsigned long now = jiffies; - unsigned long elapsed = now - bdi->bw_time_stamp; + unsigned long elapsed = now - wb->bw_time_stamp; unsigned long dirtied; unsigned long written; + lockdep_assert_held(&wb->list_lock); + /* * rate-limit, only update once every 200ms. */ if (elapsed < BANDWIDTH_INTERVAL) return; - dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]); - written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]); + dirtied = percpu_counter_read(&wb->stat[WB_DIRTIED]); + written = percpu_counter_read(&wb->stat[WB_WRITTEN]); /* * Skip quiet periods when disk bandwidth is under-utilized. * (at least 1s idle time between two flusher runs) */ - if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time)) + if (elapsed > HZ && time_before(wb->bw_time_stamp, start_time)) goto snapshot; - if (thresh) { - global_update_bandwidth(thresh, dirty, now); - bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty, - dirtied, elapsed); + if (update_ratelimit) { + domain_update_bandwidth(gdtc, now); + wb_update_dirty_ratelimit(gdtc, dirtied, elapsed); + + /* + * @mdtc is always NULL if !CGROUP_WRITEBACK but the + * compiler has no way to figure that out. Help it. + */ + if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) { + domain_update_bandwidth(mdtc, now); + wb_update_dirty_ratelimit(mdtc, dirtied, elapsed); + } } - bdi_update_write_bandwidth(bdi, elapsed, written); + wb_update_write_bandwidth(wb, elapsed, written); snapshot: - bdi->dirtied_stamp = dirtied; - bdi->written_stamp = written; - bdi->bw_time_stamp = now; + wb->dirtied_stamp = dirtied; + wb->written_stamp = written; + wb->bw_time_stamp = now; } -static void bdi_update_bandwidth(struct backing_dev_info *bdi, - unsigned long thresh, - unsigned long bg_thresh, - unsigned long dirty, - unsigned long bdi_thresh, - unsigned long bdi_dirty, - unsigned long start_time) +void wb_update_bandwidth(struct bdi_writeback *wb, unsigned long start_time) { - if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL)) - return; - spin_lock(&bdi->wb.list_lock); - __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty, - bdi_thresh, bdi_dirty, start_time); - spin_unlock(&bdi->wb.list_lock); + struct dirty_throttle_control gdtc = { GDTC_INIT(wb) }; + + __wb_update_bandwidth(&gdtc, NULL, start_time, false); } /* @@ -1187,10 +1366,10 @@ static unsigned long dirty_poll_interval(unsigned long dirty, return 1; } -static unsigned long bdi_max_pause(struct backing_dev_info *bdi, - unsigned long bdi_dirty) +static unsigned long wb_max_pause(struct bdi_writeback *wb, + unsigned long wb_dirty) { - unsigned long bw = bdi->avg_write_bandwidth; + unsigned long bw = wb->avg_write_bandwidth; unsigned long t; /* @@ -1200,20 +1379,20 @@ static unsigned long bdi_max_pause(struct backing_dev_info *bdi, * * 8 serves as the safety ratio. */ - t = bdi_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8)); + t = wb_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8)); t++; return min_t(unsigned long, t, MAX_PAUSE); } -static long bdi_min_pause(struct backing_dev_info *bdi, - long max_pause, - unsigned long task_ratelimit, - unsigned long dirty_ratelimit, - int *nr_dirtied_pause) +static long wb_min_pause(struct bdi_writeback *wb, + long max_pause, + unsigned long task_ratelimit, + unsigned long dirty_ratelimit, + int *nr_dirtied_pause) { - long hi = ilog2(bdi->avg_write_bandwidth); - long lo = ilog2(bdi->dirty_ratelimit); + long hi = ilog2(wb->avg_write_bandwidth); + long lo = ilog2(wb->dirty_ratelimit); long t; /* target pause */ long pause; /* estimated next pause */ int pages; /* target nr_dirtied_pause */ @@ -1281,34 +1460,27 @@ static long bdi_min_pause(struct backing_dev_info *bdi, return pages >= DIRTY_POLL_THRESH ? 1 + t / 2 : t; } -static inline void bdi_dirty_limits(struct backing_dev_info *bdi, - unsigned long dirty_thresh, - unsigned long background_thresh, - unsigned long *bdi_dirty, - unsigned long *bdi_thresh, - unsigned long *bdi_bg_thresh) +static inline void wb_dirty_limits(struct dirty_throttle_control *dtc) { - unsigned long bdi_reclaimable; + struct bdi_writeback *wb = dtc->wb; + unsigned long wb_reclaimable; /* - * bdi_thresh is not treated as some limiting factor as + * wb_thresh is not treated as some limiting factor as * dirty_thresh, due to reasons - * - in JBOD setup, bdi_thresh can fluctuate a lot + * - in JBOD setup, wb_thresh can fluctuate a lot * - in a system with HDD and USB key, the USB key may somehow - * go into state (bdi_dirty >> bdi_thresh) either because - * bdi_dirty starts high, or because bdi_thresh drops low. + * go into state (wb_dirty >> wb_thresh) either because + * wb_dirty starts high, or because wb_thresh drops low. * In this case we don't want to hard throttle the USB key - * dirtiers for 100 seconds until bdi_dirty drops under - * bdi_thresh. Instead the auxiliary bdi control line in - * bdi_position_ratio() will let the dirtier task progress - * at some rate <= (write_bw / 2) for bringing down bdi_dirty. + * dirtiers for 100 seconds until wb_dirty drops under + * wb_thresh. Instead the auxiliary wb control line in + * wb_position_ratio() will let the dirtier task progress + * at some rate <= (write_bw / 2) for bringing down wb_dirty. */ - *bdi_thresh = bdi_dirty_limit(bdi, dirty_thresh); - - if (bdi_bg_thresh) - *bdi_bg_thresh = dirty_thresh ? div_u64((u64)*bdi_thresh * - background_thresh, - dirty_thresh) : 0; + dtc->wb_thresh = __wb_calc_thresh(dtc); + dtc->wb_bg_thresh = dtc->thresh ? + div_u64((u64)dtc->wb_thresh * dtc->bg_thresh, dtc->thresh) : 0; /* * In order to avoid the stacked BDI deadlock we need @@ -1320,14 +1492,12 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi, * actually dirty; with m+n sitting in the percpu * deltas. */ - if (*bdi_thresh < 2 * bdi_stat_error(bdi)) { - bdi_reclaimable = bdi_stat_sum(bdi, BDI_RECLAIMABLE); - *bdi_dirty = bdi_reclaimable + - bdi_stat_sum(bdi, BDI_WRITEBACK); + if (dtc->wb_thresh < 2 * wb_stat_error(wb)) { + wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); + dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); } else { - bdi_reclaimable = bdi_stat(bdi, BDI_RECLAIMABLE); - *bdi_dirty = bdi_reclaimable + - bdi_stat(bdi, BDI_WRITEBACK); + wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE); + dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); } } @@ -1339,12 +1509,16 @@ static inline void bdi_dirty_limits(struct backing_dev_info *bdi, * perform some writeout. */ static void balance_dirty_pages(struct address_space *mapping, + struct bdi_writeback *wb, unsigned long pages_dirtied) { + struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; + struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) }; + struct dirty_throttle_control * const gdtc = &gdtc_stor; + struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? + &mdtc_stor : NULL; + struct dirty_throttle_control *sdtc; unsigned long nr_reclaimable; /* = file_dirty + unstable_nfs */ - unsigned long nr_dirty; /* = file_dirty + writeback + unstable_nfs */ - unsigned long background_thresh; - unsigned long dirty_thresh; long period; long pause; long max_pause; @@ -1353,18 +1527,14 @@ static void balance_dirty_pages(struct address_space *mapping, bool dirty_exceeded = false; unsigned long task_ratelimit; unsigned long dirty_ratelimit; - unsigned long pos_ratio; - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct backing_dev_info *bdi = wb->bdi; bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; unsigned long start_time = jiffies; for (;;) { unsigned long now = jiffies; - unsigned long uninitialized_var(bdi_thresh); - unsigned long thresh; - unsigned long uninitialized_var(bdi_dirty); - unsigned long dirty; - unsigned long bg_thresh; + unsigned long dirty, thresh, bg_thresh; + unsigned long m_dirty, m_thresh, m_bg_thresh; /* * Unstable writes are a feature of certain networked @@ -1374,65 +1544,127 @@ static void balance_dirty_pages(struct address_space *mapping, */ nr_reclaimable = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); - nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); + gdtc->avail = global_dirtyable_memory(); + gdtc->dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); - global_dirty_limits(&background_thresh, &dirty_thresh); + domain_dirty_limits(gdtc); if (unlikely(strictlimit)) { - bdi_dirty_limits(bdi, dirty_thresh, background_thresh, - &bdi_dirty, &bdi_thresh, &bg_thresh); + wb_dirty_limits(gdtc); - dirty = bdi_dirty; - thresh = bdi_thresh; + dirty = gdtc->wb_dirty; + thresh = gdtc->wb_thresh; + bg_thresh = gdtc->wb_bg_thresh; } else { - dirty = nr_dirty; - thresh = dirty_thresh; - bg_thresh = background_thresh; + dirty = gdtc->dirty; + thresh = gdtc->thresh; + bg_thresh = gdtc->bg_thresh; + } + + if (mdtc) { + unsigned long writeback; + + /* + * If @wb belongs to !root memcg, repeat the same + * basic calculations for the memcg domain. + */ + mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, + &writeback); + mdtc_cap_avail(mdtc); + mdtc->dirty += writeback; + + domain_dirty_limits(mdtc); + + if (unlikely(strictlimit)) { + wb_dirty_limits(mdtc); + m_dirty = mdtc->wb_dirty; + m_thresh = mdtc->wb_thresh; + m_bg_thresh = mdtc->wb_bg_thresh; + } else { + m_dirty = mdtc->dirty; + m_thresh = mdtc->thresh; + m_bg_thresh = mdtc->bg_thresh; + } } /* * Throttle it only when the background writeback cannot * catch-up. This avoids (excessively) small writeouts - * when the bdi limits are ramping up in case of !strictlimit. + * when the wb limits are ramping up in case of !strictlimit. * - * In strictlimit case make decision based on the bdi counters - * and limits. Small writeouts when the bdi limits are ramping + * In strictlimit case make decision based on the wb counters + * and limits. Small writeouts when the wb limits are ramping * up are the price we consciously pay for strictlimit-ing. + * + * If memcg domain is in effect, @dirty should be under + * both global and memcg freerun ceilings. */ - if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) { + if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && + (!mdtc || + m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { + unsigned long intv = dirty_poll_interval(dirty, thresh); + unsigned long m_intv = ULONG_MAX; + current->dirty_paused_when = now; current->nr_dirtied = 0; - current->nr_dirtied_pause = - dirty_poll_interval(dirty, thresh); + if (mdtc) + m_intv = dirty_poll_interval(m_dirty, m_thresh); + current->nr_dirtied_pause = min(intv, m_intv); break; } - if (unlikely(!writeback_in_progress(bdi))) - bdi_start_background_writeback(bdi); + if (unlikely(!writeback_in_progress(wb))) + wb_start_background_writeback(wb); + /* + * Calculate global domain's pos_ratio and select the + * global dtc by default. + */ if (!strictlimit) - bdi_dirty_limits(bdi, dirty_thresh, background_thresh, - &bdi_dirty, &bdi_thresh, NULL); - - dirty_exceeded = (bdi_dirty > bdi_thresh) && - ((nr_dirty > dirty_thresh) || strictlimit); - if (dirty_exceeded && !bdi->dirty_exceeded) - bdi->dirty_exceeded = 1; - - bdi_update_bandwidth(bdi, dirty_thresh, background_thresh, - nr_dirty, bdi_thresh, bdi_dirty, - start_time); - - dirty_ratelimit = bdi->dirty_ratelimit; - pos_ratio = bdi_position_ratio(bdi, dirty_thresh, - background_thresh, nr_dirty, - bdi_thresh, bdi_dirty); - task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >> + wb_dirty_limits(gdtc); + + dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && + ((gdtc->dirty > gdtc->thresh) || strictlimit); + + wb_position_ratio(gdtc); + sdtc = gdtc; + + if (mdtc) { + /* + * If memcg domain is in effect, calculate its + * pos_ratio. @wb should satisfy constraints from + * both global and memcg domains. Choose the one + * w/ lower pos_ratio. + */ + if (!strictlimit) + wb_dirty_limits(mdtc); + + dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && + ((mdtc->dirty > mdtc->thresh) || strictlimit); + + wb_position_ratio(mdtc); + if (mdtc->pos_ratio < gdtc->pos_ratio) + sdtc = mdtc; + } + + if (dirty_exceeded && !wb->dirty_exceeded) + wb->dirty_exceeded = 1; + + if (time_is_before_jiffies(wb->bw_time_stamp + + BANDWIDTH_INTERVAL)) { + spin_lock(&wb->list_lock); + __wb_update_bandwidth(gdtc, mdtc, start_time, true); + spin_unlock(&wb->list_lock); + } + + /* throttle according to the chosen dtc */ + dirty_ratelimit = wb->dirty_ratelimit; + task_ratelimit = ((u64)dirty_ratelimit * sdtc->pos_ratio) >> RATELIMIT_CALC_SHIFT; - max_pause = bdi_max_pause(bdi, bdi_dirty); - min_pause = bdi_min_pause(bdi, max_pause, - task_ratelimit, dirty_ratelimit, - &nr_dirtied_pause); + max_pause = wb_max_pause(wb, sdtc->wb_dirty); + min_pause = wb_min_pause(wb, max_pause, + task_ratelimit, dirty_ratelimit, + &nr_dirtied_pause); if (unlikely(task_ratelimit == 0)) { period = max_pause; @@ -1452,11 +1684,11 @@ static void balance_dirty_pages(struct address_space *mapping, */ if (pause < min_pause) { trace_balance_dirty_pages(bdi, - dirty_thresh, - background_thresh, - nr_dirty, - bdi_thresh, - bdi_dirty, + sdtc->thresh, + sdtc->bg_thresh, + sdtc->dirty, + sdtc->wb_thresh, + sdtc->wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, @@ -1481,11 +1713,11 @@ static void balance_dirty_pages(struct address_space *mapping, pause: trace_balance_dirty_pages(bdi, - dirty_thresh, - background_thresh, - nr_dirty, - bdi_thresh, - bdi_dirty, + sdtc->thresh, + sdtc->bg_thresh, + sdtc->dirty, + sdtc->wb_thresh, + sdtc->wb_dirty, dirty_ratelimit, task_ratelimit, pages_dirtied, @@ -1500,33 +1732,33 @@ pause: current->nr_dirtied_pause = nr_dirtied_pause; /* - * This is typically equal to (nr_dirty < dirty_thresh) and can - * also keep "1000+ dd on a slow USB stick" under control. + * This is typically equal to (dirty < thresh) and can also + * keep "1000+ dd on a slow USB stick" under control. */ if (task_ratelimit) break; /* * In the case of an unresponding NFS server and the NFS dirty - * pages exceeds dirty_thresh, give the other good bdi's a pipe + * pages exceeds dirty_thresh, give the other good wb's a pipe * to go through, so that tasks on them still remain responsive. * * In theory 1 page is enough to keep the comsumer-producer * pipe going: the flusher cleans 1 page => the task dirties 1 - * more page. However bdi_dirty has accounting errors. So use - * the larger and more IO friendly bdi_stat_error. + * more page. However wb_dirty has accounting errors. So use + * the larger and more IO friendly wb_stat_error. */ - if (bdi_dirty <= bdi_stat_error(bdi)) + if (sdtc->wb_dirty <= wb_stat_error(wb)) break; if (fatal_signal_pending(current)) break; } - if (!dirty_exceeded && bdi->dirty_exceeded) - bdi->dirty_exceeded = 0; + if (!dirty_exceeded && wb->dirty_exceeded) + wb->dirty_exceeded = 0; - if (writeback_in_progress(bdi)) + if (writeback_in_progress(wb)) return; /* @@ -1540,8 +1772,8 @@ pause: if (laptop_mode) return; - if (nr_reclaimable > background_thresh) - bdi_start_background_writeback(bdi); + if (nr_reclaimable > gdtc->bg_thresh) + wb_start_background_writeback(wb); } static DEFINE_PER_CPU(int, bdp_ratelimits); @@ -1577,15 +1809,22 @@ DEFINE_PER_CPU(int, dirty_throttle_leaks) = 0; */ void balance_dirty_pages_ratelimited(struct address_space *mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); + struct bdi_writeback *wb = NULL; int ratelimit; int *p; if (!bdi_cap_account_dirty(bdi)) return; + if (inode_cgwb_enabled(inode)) + wb = wb_get_create_current(bdi, GFP_KERNEL); + if (!wb) + wb = &bdi->wb; + ratelimit = current->nr_dirtied_pause; - if (bdi->dirty_exceeded) + if (wb->dirty_exceeded) ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10)); preempt_disable(); @@ -1617,10 +1856,59 @@ void balance_dirty_pages_ratelimited(struct address_space *mapping) preempt_enable(); if (unlikely(current->nr_dirtied >= ratelimit)) - balance_dirty_pages(mapping, current->nr_dirtied); + balance_dirty_pages(mapping, wb, current->nr_dirtied); + + wb_put(wb); } EXPORT_SYMBOL(balance_dirty_pages_ratelimited); +/** + * wb_over_bg_thresh - does @wb need to be written back? + * @wb: bdi_writeback of interest + * + * Determines whether background writeback should keep writing @wb or it's + * clean enough. Returns %true if writeback should continue. + */ +bool wb_over_bg_thresh(struct bdi_writeback *wb) +{ + struct dirty_throttle_control gdtc_stor = { GDTC_INIT(wb) }; + struct dirty_throttle_control mdtc_stor = { MDTC_INIT(wb, &gdtc_stor) }; + struct dirty_throttle_control * const gdtc = &gdtc_stor; + struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ? + &mdtc_stor : NULL; + + /* + * Similar to balance_dirty_pages() but ignores pages being written + * as we're trying to decide whether to put more under writeback. + */ + gdtc->avail = global_dirtyable_memory(); + gdtc->dirty = global_page_state(NR_FILE_DIRTY) + + global_page_state(NR_UNSTABLE_NFS); + domain_dirty_limits(gdtc); + + if (gdtc->dirty > gdtc->bg_thresh) + return true; + + if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(gdtc)) + return true; + + if (mdtc) { + unsigned long writeback; + + mem_cgroup_wb_stats(wb, &mdtc->avail, &mdtc->dirty, &writeback); + mdtc_cap_avail(mdtc); + domain_dirty_limits(mdtc); /* ditto, ignore writeback */ + + if (mdtc->dirty > mdtc->bg_thresh) + return true; + + if (wb_stat(wb, WB_RECLAIMABLE) > __wb_calc_thresh(mdtc)) + return true; + } + + return false; +} + void throttle_vm_writeout(gfp_t gfp_mask) { unsigned long background_thresh; @@ -1628,7 +1916,7 @@ void throttle_vm_writeout(gfp_t gfp_mask) for ( ; ; ) { global_dirty_limits(&background_thresh, &dirty_thresh); - dirty_thresh = hard_dirty_limit(dirty_thresh); + dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh); /* * Boost the allowable dirty threshold a bit for page @@ -1667,14 +1955,20 @@ void laptop_mode_timer_fn(unsigned long data) struct request_queue *q = (struct request_queue *)data; int nr_pages = global_page_state(NR_FILE_DIRTY) + global_page_state(NR_UNSTABLE_NFS); + struct bdi_writeback *wb; + struct wb_iter iter; /* * We want to write everything out, not just down to the dirty * threshold */ - if (bdi_has_dirty_io(&q->backing_dev_info)) - bdi_start_writeback(&q->backing_dev_info, nr_pages, - WB_REASON_LAPTOP_TIMER); + if (!bdi_has_dirty_io(&q->backing_dev_info)) + return; + + bdi_for_each_wb(wb, &q->backing_dev_info, &iter, 0) + if (wb_has_dirty_io(wb)) + wb_start_writeback(wb, nr_pages, true, + WB_REASON_LAPTOP_TIMER); } /* @@ -1718,10 +2012,12 @@ void laptop_sync_completion(void) void writeback_set_ratelimit(void) { + struct wb_domain *dom = &global_wb_domain; unsigned long background_thresh; unsigned long dirty_thresh; + global_dirty_limits(&background_thresh, &dirty_thresh); - global_dirty_limit = dirty_thresh; + dom->dirty_limit = dirty_thresh; ratelimit_pages = dirty_thresh / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16; @@ -1770,7 +2066,7 @@ void __init page_writeback_init(void) writeback_set_ratelimit(); register_cpu_notifier(&ratelimit_nb); - fprop_global_init(&writeout_completions, GFP_KERNEL); + BUG_ON(wb_domain_init(&global_wb_domain, GFP_KERNEL)); } /** @@ -2090,19 +2386,29 @@ int __set_page_dirty_no_writeback(struct page *page) /* * Helper function for set_page_dirty family. + * + * Caller must hold mem_cgroup_begin_page_stat(). + * * NOTE: This relies on being atomic wrt interrupts. */ -void account_page_dirtied(struct page *page, struct address_space *mapping) +void account_page_dirtied(struct page *page, struct address_space *mapping, + struct mem_cgroup *memcg) { + struct inode *inode = mapping->host; + trace_writeback_dirty_page(page, mapping); if (mapping_cap_account_dirty(mapping)) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct bdi_writeback *wb; + inode_attach_wb(inode, page); + wb = inode_to_wb(inode); + + mem_cgroup_inc_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_DIRTIED); - __inc_bdi_stat(bdi, BDI_RECLAIMABLE); - __inc_bdi_stat(bdi, BDI_DIRTIED); + __inc_wb_stat(wb, WB_RECLAIMABLE); + __inc_wb_stat(wb, WB_DIRTIED); task_io_account_write(PAGE_CACHE_SIZE); current->nr_dirtied++; this_cpu_inc(bdp_ratelimits); @@ -2113,21 +2419,18 @@ EXPORT_SYMBOL(account_page_dirtied); /* * Helper function for deaccounting dirty page without writeback. * - * Doing this should *normally* only ever be done when a page - * is truncated, and is not actually mapped anywhere at all. However, - * fs/buffer.c does this when it notices that somebody has cleaned - * out all the buffers on a page without actually doing it through - * the VM. Can you say "ext3 is horribly ugly"? Thought you could. + * Caller must hold mem_cgroup_begin_page_stat(). */ -void account_page_cleaned(struct page *page, struct address_space *mapping) +void account_page_cleaned(struct page *page, struct address_space *mapping, + struct mem_cgroup *memcg, struct bdi_writeback *wb) { if (mapping_cap_account_dirty(mapping)) { + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(inode_to_bdi(mapping->host), BDI_RECLAIMABLE); + dec_wb_stat(wb, WB_RECLAIMABLE); task_io_account_cancelled_write(PAGE_CACHE_SIZE); } } -EXPORT_SYMBOL(account_page_cleaned); /* * For address_spaces which do not use buffers. Just tag the page as dirty in @@ -2143,26 +2446,34 @@ EXPORT_SYMBOL(account_page_cleaned); */ int __set_page_dirty_nobuffers(struct page *page) { + struct mem_cgroup *memcg; + + memcg = mem_cgroup_begin_page_stat(page); if (!TestSetPageDirty(page)) { struct address_space *mapping = page_mapping(page); unsigned long flags; - if (!mapping) + if (!mapping) { + mem_cgroup_end_page_stat(memcg); return 1; + } spin_lock_irqsave(&mapping->tree_lock, flags); BUG_ON(page_mapping(page) != mapping); WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page)); - account_page_dirtied(page, mapping); + account_page_dirtied(page, mapping, memcg); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); + if (mapping->host) { /* !PageAnon && !swapper_space */ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } return 1; } + mem_cgroup_end_page_stat(memcg); return 0; } EXPORT_SYMBOL(__set_page_dirty_nobuffers); @@ -2177,10 +2488,17 @@ EXPORT_SYMBOL(__set_page_dirty_nobuffers); void account_page_redirty(struct page *page) { struct address_space *mapping = page->mapping; + if (mapping && mapping_cap_account_dirty(mapping)) { + struct inode *inode = mapping->host; + struct bdi_writeback *wb; + bool locked; + + wb = unlocked_inode_to_wb_begin(inode, &locked); current->nr_dirtied--; dec_zone_page_state(page, NR_DIRTIED); - dec_bdi_stat(inode_to_bdi(mapping->host), BDI_DIRTIED); + dec_wb_stat(wb, WB_DIRTIED); + unlocked_inode_to_wb_end(inode, locked); } } EXPORT_SYMBOL(account_page_redirty); @@ -2266,6 +2584,43 @@ int set_page_dirty_lock(struct page *page) EXPORT_SYMBOL(set_page_dirty_lock); /* + * This cancels just the dirty bit on the kernel page itself, it does NOT + * actually remove dirty bits on any mmap's that may be around. It also + * leaves the page tagged dirty, so any sync activity will still find it on + * the dirty lists, and in particular, clear_page_dirty_for_io() will still + * look at the dirty bits in the VM. + * + * Doing this should *normally* only ever be done when a page is truncated, + * and is not actually mapped anywhere at all. However, fs/buffer.c does + * this when it notices that somebody has cleaned out all the buffers on a + * page without actually doing it through the VM. Can you say "ext3 is + * horribly ugly"? Thought you could. + */ +void cancel_dirty_page(struct page *page) +{ + struct address_space *mapping = page_mapping(page); + + if (mapping_cap_account_dirty(mapping)) { + struct inode *inode = mapping->host; + struct bdi_writeback *wb; + struct mem_cgroup *memcg; + bool locked; + + memcg = mem_cgroup_begin_page_stat(page); + wb = unlocked_inode_to_wb_begin(inode, &locked); + + if (TestClearPageDirty(page)) + account_page_cleaned(page, mapping, memcg, wb); + + unlocked_inode_to_wb_end(inode, locked); + mem_cgroup_end_page_stat(memcg); + } else { + ClearPageDirty(page); + } +} +EXPORT_SYMBOL(cancel_dirty_page); + +/* * Clear a page's dirty flag, while caring for dirty memory accounting. * Returns true if the page was previously dirty. * @@ -2282,10 +2637,16 @@ EXPORT_SYMBOL(set_page_dirty_lock); int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); + int ret = 0; BUG_ON(!PageLocked(page)); if (mapping && mapping_cap_account_dirty(mapping)) { + struct inode *inode = mapping->host; + struct bdi_writeback *wb; + struct mem_cgroup *memcg; + bool locked; + /* * Yes, Virginia, this is indeed insane. * @@ -2321,13 +2682,17 @@ int clear_page_dirty_for_io(struct page *page) * always locked coming in here, so we get the desired * exclusion. */ + memcg = mem_cgroup_begin_page_stat(page); + wb = unlocked_inode_to_wb_begin(inode, &locked); if (TestClearPageDirty(page)) { + mem_cgroup_dec_page_stat(memcg, MEM_CGROUP_STAT_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); - dec_bdi_stat(inode_to_bdi(mapping->host), - BDI_RECLAIMABLE); - return 1; + dec_wb_stat(wb, WB_RECLAIMABLE); + ret = 1; } - return 0; + unlocked_inode_to_wb_end(inode, locked); + mem_cgroup_end_page_stat(memcg); + return ret; } return TestClearPageDirty(page); } @@ -2341,7 +2706,8 @@ int test_clear_page_writeback(struct page *page) memcg = mem_cgroup_begin_page_stat(page); if (mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); unsigned long flags; spin_lock_irqsave(&mapping->tree_lock, flags); @@ -2351,8 +2717,10 @@ int test_clear_page_writeback(struct page *page) page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) { - __dec_bdi_stat(bdi, BDI_WRITEBACK); - __bdi_writeout_inc(bdi); + struct bdi_writeback *wb = inode_to_wb(inode); + + __dec_wb_stat(wb, WB_WRITEBACK); + __wb_writeout_inc(wb); } } spin_unlock_irqrestore(&mapping->tree_lock, flags); @@ -2376,7 +2744,8 @@ int __test_set_page_writeback(struct page *page, bool keep_write) memcg = mem_cgroup_begin_page_stat(page); if (mapping) { - struct backing_dev_info *bdi = inode_to_bdi(mapping->host); + struct inode *inode = mapping->host; + struct backing_dev_info *bdi = inode_to_bdi(inode); unsigned long flags; spin_lock_irqsave(&mapping->tree_lock, flags); @@ -2386,7 +2755,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write) page_index(page), PAGECACHE_TAG_WRITEBACK); if (bdi_cap_account_writeback(bdi)) - __inc_bdi_stat(bdi, BDI_WRITEBACK); + __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK); } if (!PageDirty(page)) radix_tree_tag_clear(&mapping->page_tree, diff --git a/mm/readahead.c b/mm/readahead.c index 935675844b2e..60cd846a9a44 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -541,7 +541,7 @@ page_cache_async_readahead(struct address_space *mapping, /* * Defer asynchronous read-ahead on IO congestion. */ - if (bdi_read_congested(inode_to_bdi(mapping->host))) + if (inode_read_congested(mapping->host)) return; /* do read-ahead */ diff --git a/mm/rmap.c b/mm/rmap.c index 24dd3f9fee27..8fc556ce2dcb 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -30,6 +30,8 @@ * swap_lock (in swap_duplicate, swap_info_get) * mmlist_lock (in mmput, drain_mmlist and others) * mapping->private_lock (in __set_page_dirty_buffers) + * mem_cgroup_{begin,end}_page_stat (memcg->move_lock) + * mapping->tree_lock (widely used) * inode->i_lock (in set_page_dirty's __mark_inode_dirty) * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) * sb_lock (within inode_lock in fs/fs-writeback.c) diff --git a/mm/truncate.c b/mm/truncate.c index 66af9031fae8..76e35ad97102 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -116,9 +116,7 @@ truncate_complete_page(struct address_space *mapping, struct page *page) * the VM has canceled the dirty bit (eg ext3 journaling). * Hence dirty accounting check is placed after invalidation. */ - if (TestClearPageDirty(page)) - account_page_cleaned(page, mapping); - + cancel_dirty_page(page); ClearPageMappedToDisk(page); delete_from_page_cache(page); return 0; @@ -512,19 +510,24 @@ EXPORT_SYMBOL(invalidate_mapping_pages); static int invalidate_complete_page2(struct address_space *mapping, struct page *page) { + struct mem_cgroup *memcg; + unsigned long flags; + if (page->mapping != mapping) return 0; if (page_has_private(page) && !try_to_release_page(page, GFP_KERNEL)) return 0; - spin_lock_irq(&mapping->tree_lock); + memcg = mem_cgroup_begin_page_stat(page); + spin_lock_irqsave(&mapping->tree_lock, flags); if (PageDirty(page)) goto failed; BUG_ON(page_has_private(page)); - __delete_from_page_cache(page, NULL); - spin_unlock_irq(&mapping->tree_lock); + __delete_from_page_cache(page, NULL, memcg); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); if (mapping->a_ops->freepage) mapping->a_ops->freepage(page); @@ -532,7 +535,8 @@ invalidate_complete_page2(struct address_space *mapping, struct page *page) page_cache_release(page); /* pagecache ref */ return 1; failed: - spin_unlock_irq(&mapping->tree_lock); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); return 0; } diff --git a/mm/vmscan.c b/mm/vmscan.c index 1a17bd7c0ce5..b0c61df92b81 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -154,11 +154,42 @@ static bool global_reclaim(struct scan_control *sc) { return !sc->target_mem_cgroup; } + +/** + * sane_reclaim - is the usual dirty throttling mechanism operational? + * @sc: scan_control in question + * + * The normal page dirty throttling mechanism in balance_dirty_pages() is + * completely broken with the legacy memcg and direct stalling in + * shrink_page_list() is used for throttling instead, which lacks all the + * niceties such as fairness, adaptive pausing, bandwidth proportional + * allocation and configurability. + * + * This function tests whether the vmscan currently in progress can assume + * that the normal dirty throttling mechanism is operational. + */ +static bool sane_reclaim(struct scan_control *sc) +{ + struct mem_cgroup *memcg = sc->target_mem_cgroup; + + if (!memcg) + return true; +#ifdef CONFIG_CGROUP_WRITEBACK + if (cgroup_on_dfl(mem_cgroup_css(memcg)->cgroup)) + return true; +#endif + return false; +} #else static bool global_reclaim(struct scan_control *sc) { return true; } + +static bool sane_reclaim(struct scan_control *sc) +{ + return true; +} #endif static unsigned long zone_reclaimable_pages(struct zone *zone) @@ -452,14 +483,13 @@ static inline int is_page_cache_freeable(struct page *page) return page_count(page) - page_has_private(page) == 2; } -static int may_write_to_queue(struct backing_dev_info *bdi, - struct scan_control *sc) +static int may_write_to_inode(struct inode *inode, struct scan_control *sc) { if (current->flags & PF_SWAPWRITE) return 1; - if (!bdi_write_congested(bdi)) + if (!inode_write_congested(inode)) return 1; - if (bdi == current->backing_dev_info) + if (inode_to_bdi(inode) == current->backing_dev_info) return 1; return 0; } @@ -538,7 +568,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, } if (mapping->a_ops->writepage == NULL) return PAGE_ACTIVATE; - if (!may_write_to_queue(inode_to_bdi(mapping->host), sc)) + if (!may_write_to_inode(mapping->host, sc)) return PAGE_KEEP; if (clear_page_dirty_for_io(page)) { @@ -579,10 +609,14 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, static int __remove_mapping(struct address_space *mapping, struct page *page, bool reclaimed) { + unsigned long flags; + struct mem_cgroup *memcg; + BUG_ON(!PageLocked(page)); BUG_ON(mapping != page_mapping(page)); - spin_lock_irq(&mapping->tree_lock); + memcg = mem_cgroup_begin_page_stat(page); + spin_lock_irqsave(&mapping->tree_lock, flags); /* * The non racy check for a busy page. * @@ -620,7 +654,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, swp_entry_t swap = { .val = page_private(page) }; mem_cgroup_swapout(page, swap); __delete_from_swap_cache(page); - spin_unlock_irq(&mapping->tree_lock); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); swapcache_free(swap); } else { void (*freepage)(struct page *); @@ -640,8 +675,9 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, if (reclaimed && page_is_file_cache(page) && !mapping_exiting(mapping)) shadow = workingset_eviction(mapping, page); - __delete_from_page_cache(page, shadow); - spin_unlock_irq(&mapping->tree_lock); + __delete_from_page_cache(page, shadow, memcg); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); if (freepage != NULL) freepage(page); @@ -650,7 +686,8 @@ static int __remove_mapping(struct address_space *mapping, struct page *page, return 1; cannot_free: - spin_unlock_irq(&mapping->tree_lock); + spin_unlock_irqrestore(&mapping->tree_lock, flags); + mem_cgroup_end_page_stat(memcg); return 0; } @@ -917,7 +954,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, */ mapping = page_mapping(page); if (((dirty || writeback) && mapping && - bdi_write_congested(inode_to_bdi(mapping->host))) || + inode_write_congested(mapping->host)) || (writeback && PageReclaim(page))) nr_congested++; @@ -935,11 +972,11 @@ static unsigned long shrink_page_list(struct list_head *page_list, * note that the LRU is being scanned too quickly and the * caller can stall after page list has been processed. * - * 2) Global reclaim encounters a page, memcg encounters a - * page that is not marked for immediate reclaim or - * the caller does not have __GFP_FS (or __GFP_IO if it's - * simply going to swap, not to fs). In this case mark - * the page for immediate reclaim and continue scanning. + * 2) Global or new memcg reclaim encounters a page that is + * not marked for immediate reclaim, or the caller does not + * have __GFP_FS (or __GFP_IO if it's simply going to swap, + * not to fs). In this case mark the page for immediate + * reclaim and continue scanning. * * Require may_enter_fs because we would wait on fs, which * may not have submitted IO yet. And the loop driver might @@ -948,7 +985,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, * __GFP_IO|__GFP_FS for this reason); but more thought * would probably show more reasons. * - * 3) memcg encounters a page that is not already marked + * 3) Legacy memcg encounters a page that is not already marked * PageReclaim. memcg does not have any dirty pages * throttling so we could easily OOM just because too many * pages are in writeback and there is nothing else to @@ -963,7 +1000,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto keep_locked; /* Case 2 above */ - } else if (global_reclaim(sc) || + } else if (sane_reclaim(sc) || !PageReclaim(page) || !may_enter_fs) { /* * This is slightly racy - end_page_writeback() @@ -1412,7 +1449,7 @@ static int too_many_isolated(struct zone *zone, int file, if (current_is_kswapd()) return 0; - if (!global_reclaim(sc)) + if (!sane_reclaim(sc)) return 0; if (file) { @@ -1604,10 +1641,10 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, set_bit(ZONE_WRITEBACK, &zone->flags); /* - * memcg will stall in page writeback so only consider forcibly - * stalling for global reclaim + * Legacy memcg will stall in page writeback so avoid forcibly + * stalling here. */ - if (global_reclaim(sc)) { + if (sane_reclaim(sc)) { /* * Tag a zone as congested if all the dirty pages scanned were * backed by a congested BDI and wait_iff_congested will stall. |