aboutsummaryrefslogtreecommitdiff
path: root/mm/memcontrol.c
AgeCommit message (Collapse)Author
2014-02-13memcg: fix css reference leak and endless loop in mem_cgroup_iterMichal Hocko
commit 0eef615665ede1e0d603ea9ecca88c1da6f02234 upstream. Commit 19f39402864e ("memcg: simplify mem_cgroup_iter") has reorganized mem_cgroup_iter code in order to simplify it. A part of that change was dropping an optimization which didn't call css_tryget on the root of the walked tree. The patch however didn't change the css_put part in mem_cgroup_iter which excludes root. This wasn't an issue at the time because __mem_cgroup_iter_next bailed out for root early without taking a reference as cgroup iterators (css_next_descendant_pre) didn't visit root themselves. Nevertheless cgroup iterators have been reworked to visit root by commit bd8815a6d802 ("cgroup: make css_for_each_descendant() and friends include the origin css in the iteration") when the root bypass have been dropped in __mem_cgroup_iter_next. This means that css_put is not called for root and so css along with mem_cgroup and other cgroup internal object tied by css lifetime are never freed. Fix the issue by reintroducing root check in __mem_cgroup_iter_next and do not take css reference for it. This reference counting magic protects us also from another issue, an endless loop reported by Hugh Dickins when reclaim races with root removal and css_tryget called by iterator internally would fail. There would be no other nodes to visit so __mem_cgroup_iter_next would return NULL and mem_cgroup_iter would interpret it as "start looping from root again" and so mem_cgroup_iter would loop forever internally. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13memcg: fix endless loop caused by mem_cgroup_iterMichal Hocko
commit ecc736fc3c71c411a9d201d8588c9e7e049e5d8c upstream. Hugh has reported an endless loop when the hardlimit reclaim sees the same group all the time. This might happen when the reclaim races with the memcg removal. shrink_zone [rmdir root] mem_cgroup_iter(root, NULL, reclaim) // prev = NULL rcu_read_lock() mem_cgroup_iter_load last_visited = iter->last_visited // gets root || NULL css_tryget(last_visited) // failed last_visited = NULL [1] memcg = root = __mem_cgroup_iter_next(root, NULL) mem_cgroup_iter_update iter->last_visited = root; reclaim->generation = iter->generation mem_cgroup_iter(root, root, reclaim) // prev = root rcu_read_lock mem_cgroup_iter_load last_visited = iter->last_visited // gets root css_tryget(last_visited) // failed [1] The issue seemed to be introduced by commit 5f5781619718 ("memcg: relax memcg iter caching") which has replaced unconditional css_get/css_put by css_tryget/css_put for the cached iterator. This patch fixes the issue by skipping css_tryget on the root of the tree walk in mem_cgroup_iter_load and symmetrically doesn't release it in mem_cgroup_iter_update. Signed-off-by: Michal Hocko <mhocko@suse.cz> Reported-by: Hugh Dickins <hughd@google.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-02-13mm/memcg: iteration skip memcgs not yet fully initializedHugh Dickins
commit d8ad30559715ce97afb7d1a93a12fd90e8fff312 upstream. It is surprising that the mem_cgroup iterator can return memcgs which have not yet been fully initialized. By accident (or trial and error?) this appears not to present an actual problem; but it may be better to prevent such surprises, by skipping memcgs not yet online. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Tejun Heo <tj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-01-02memcg: fix memcg_size() calculationVladimir Davydov
The mem_cgroup structure contains nr_node_ids pointers to mem_cgroup_per_node objects, not the objects themselves. Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@openvz.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12mm: memcg: do not allow task about to OOM kill to bypass the limitJohannes Weiner
Commit 4942642080ea ("mm: memcg: handle non-error OOM situations more gracefully") allowed tasks that already entered a memcg OOM condition to bypass the memcg limit on subsequent allocation attempts hoping this would expedite finishing the page fault and executing the kill. David Rientjes is worried that this breaks memcg isolation guarantees and since there is no evidence that the bypass actually speeds up fault processing just change it so that these subsequent charge attempts fail outright. The notable exception being __GFP_NOFAIL charges which are required to bypass the limit regardless. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-bt: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12mm: memcg: fix race condition between memcg teardown and swapinJohannes Weiner
There is a race condition between a memcg being torn down and a swapin triggered from a different memcg of a page that was recorded to belong to the exiting memcg on swapout (with CONFIG_MEMCG_SWAP extension). The result is unreclaimable pages pointing to dead memcgs, which can lead to anything from endless loops in later memcg teardown (the page is charged to all hierarchical parents but is not on any LRU list) or crashes from following the dangling memcg pointer. Memcgs with tasks in them can not be torn down and usually charges don't show up in memcgs without tasks. Swapin with the CONFIG_MEMCG_SWAP extension is the notable exception because it charges the cgroup that was recorded as owner during swapout, which may be empty and in the process of being torn down when a task in another memcg triggers the swapin: teardown: swapin: lookup_swap_cgroup_id() rcu_read_lock() mem_cgroup_lookup() css_tryget() rcu_read_unlock() disable css_tryget() call_rcu() offline_css() reparent_charges() res_counter_charge() (hierarchical!) css_put() css_free() pc->mem_cgroup = dead memcg add page to dead lru Add a final reparenting step into css_free() to make sure any such raced charges are moved out of the memcg before it's finally freed. In the longer term it would be cleaner to have the css_tryget() and the res_counter charge under the same RCU lock section so that the charge reparenting is deferred until the last charge whose tryget succeeded is visible. But this will require more invasive changes that will be harder to evaluate and backport into stable, so better defer them to a separate change set. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-12-12mm: memcg: do not declare OOM from __GFP_NOFAIL allocationsJohannes Weiner
Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the allocator") started recognizing __GFP_NOFAIL in memory cgroups but forgot to disable the OOM killer. Any task that does not fail allocation will also not enter the OOM completion path. So don't declare an OOM state in this case or it'll be leaked and the task be able to bypass the limit until the next userspace-triggered page fault cleans up the OOM state. Reported-by: William Dauchy <wdauchy@gmail.com> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [3.12.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-15mm, thp: change pmd_trans_huge_lock() to return taken lockKirill A. Shutemov
With split ptlock it's important to know which lock pmd_trans_huge_lock() took. This patch adds one more parameter to the function to return the lock. In most places migration to new api is trivial. Exception is move_huge_pmd(): we need to take two locks if pmd tables are different. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Tested-by: Alex Thorlton <athorlton@sgi.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Howells <dhowells@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Robin Holt <robinmholt@gmail.com> Cc: Sedat Dilek <sedat.dilek@gmail.com> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: 1) The addition of nftables. No longer will we need protocol aware firewall filtering modules, it can all live in userspace. At the core of nftables is a, for lack of a better term, virtual machine that executes byte codes to inspect packet or metadata (arriving interface index, etc.) and make verdict decisions. Besides support for loading packet contents and comparing them, the interpreter supports lookups in various datastructures as fundamental operations. For example sets are supports, and therefore one could create a set of whitelist IP address entries which have ACCEPT verdicts attached to them, and use the appropriate byte codes to do such lookups. Since the interpreted code is composed in userspace, userspace can do things like optimize things before giving it to the kernel. Another major improvement is the capability of atomically updating portions of the ruleset. In the existing netfilter implementation, one has to update the entire rule set in order to make a change and this is very expensive. Userspace tools exist to create nftables rules using existing netfilter rule sets, but both kernel implementations will need to co-exist for quite some time as we transition from the old to the new stuff. Kudos to Patrick McHardy, Pablo Neira Ayuso, and others who have worked so hard on this. 2) Daniel Borkmann and Hannes Frederic Sowa made several improvements to our pseudo-random number generator, mostly used for things like UDP port randomization and netfitler, amongst other things. In particular the taus88 generater is updated to taus113, and test cases are added. 3) Support 64-bit rates in HTB and TBF schedulers, from Eric Dumazet and Yang Yingliang. 4) Add support for new 577xx tigon3 chips to tg3 driver, from Nithin Sujir. 5) Fix two fatal flaws in TCP dynamic right sizing, from Eric Dumazet, Neal Cardwell, and Yuchung Cheng. 6) Allow IP_TOS and IP_TTL to be specified in sendmsg() ancillary control message data, much like other socket option attributes. From Francesco Fusco. 7) Allow applications to specify a cap on the rate computed automatically by the kernel for pacing flows, via a new SO_MAX_PACING_RATE socket option. From Eric Dumazet. 8) Make the initial autotuned send buffer sizing in TCP more closely reflect actual needs, from Eric Dumazet. 9) Currently early socket demux only happens for TCP sockets, but we can do it for connected UDP sockets too. Implementation from Shawn Bohrer. 10) Refactor inet socket demux with the goal of improving hash demux performance for listening sockets. With the main goals being able to use RCU lookups on even request sockets, and eliminating the listening lock contention. From Eric Dumazet. 11) The bonding layer has many demuxes in it's fast path, and an RCU conversion was started back in 3.11, several changes here extend the RCU usage to even more locations. From Ding Tianhong and Wang Yufen, based upon suggestions by Nikolay Aleksandrov and Veaceslav Falico. 12) Allow stackability of segmentation offloads to, in particular, allow segmentation offloading over tunnels. From Eric Dumazet. 13) Significantly improve the handling of secret keys we input into the various hash functions in the inet hashtables, TCP fast open, as well as syncookies. From Hannes Frederic Sowa. The key fundamental operation is "net_get_random_once()" which uses static keys. Hannes even extended this to ipv4/ipv6 fragmentation handling and our generic flow dissector. 14) The generic driver layer takes care now to set the driver data to NULL on device removal, so it's no longer necessary for drivers to explicitly set it to NULL any more. Many drivers have been cleaned up in this way, from Jingoo Han. 15) Add a BPF based packet scheduler classifier, from Daniel Borkmann. 16) Improve CRC32 interfaces and generic SKB checksum iterators so that SCTP's checksumming can more cleanly be handled. Also from Daniel Borkmann. 17) Add a new PMTU discovery mode, IP_PMTUDISC_INTERFACE, which forces using the interface MTU value. This helps avoid PMTU attacks, particularly on DNS servers. From Hannes Frederic Sowa. 18) Use generic XPS for transmit queue steering rather than internal (re-)implementation in virtio-net. From Jason Wang. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1622 commits) random32: add test cases for taus113 implementation random32: upgrade taus88 generator to taus113 from errata paper random32: move rnd_state to linux/random.h random32: add prandom_reseed_late() and call when nonblocking pool becomes initialized random32: add periodic reseeding random32: fix off-by-one in seeding requirement PHY: Add RTL8201CP phy_driver to realtek xtsonic: add missing platform_set_drvdata() in xtsonic_probe() macmace: add missing platform_set_drvdata() in mace_probe() ethernet/arc/arc_emac: add missing platform_set_drvdata() in arc_emac_probe() ipv6: protect for_each_sk_fl_rcu in mem_check with rcu_read_lock_bh vlan: Implement vlan_dev_get_egress_qos_mask as an inline. ixgbe: add warning when max_vfs is out of range. igb: Update link modes display in ethtool netfilter: push reasm skb through instead of original frag skbs ip6_output: fragment outgoing reassembled skb properly MAINTAINERS: mv643xx_eth: take over maintainership from Lennart net_sched: tbf: support of 64bit rates ixgbe: deleting dfwd stations out of order can cause null ptr deref ixgbe: fix build err, num_rx_queues is only available with CONFIG_RPS ...
2013-11-13Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds
Merge first patch-bomb from Andrew Morton: "Quite a lot of other stuff is banked up awaiting further next->mainline merging, but this batch contains: - Lots of random misc patches - OCFS2 - Most of MM - backlight updates - lib/ updates - printk updates - checkpatch updates - epoll tweaking - rtc updates - hfs - hfsplus - documentation - procfs - update gcov to gcc-4.7 format - IPC" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (269 commits) ipc, msg: fix message length check for negative values ipc/util.c: remove unnecessary work pending test devpts: plug the memory leak in kill_sb ./Makefile: export initial ramdisk compression config option init/Kconfig: add option to disable kernel compression drivers: w1: make w1_slave::flags long to avoid memory corruption drivers/w1/masters/ds1wm.cuse dev_get_platdata() drivers/memstick/core/ms_block.c: fix unreachable state in h_msb_read_page() drivers/memstick/core/mspro_block.c: fix attributes array allocation drivers/pps/clients/pps-gpio.c: remove redundant of_match_ptr kernel/panic.c: reduce 1 byte usage for print tainted buffer gcov: reuse kbasename helper kernel/gcov/fs.c: use pr_warn() kernel/module.c: use pr_foo() gcov: compile specific gcov implementation based on gcc version gcov: add support for gcc 4.7 gcov format gcov: move gcov structs definitions to a gcc version specific file kernel/taskstats.c: return -ENOMEM when alloc memory fails in add_del_listener() kernel/taskstats.c: add nla_nest_cancel() for failure processing between nla_nest_start() and nla_nest_end() kernel/sysctl_binary.c: use scnprintf() instead of snprintf() ...
2013-11-13Merge branch 'for-3.13' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Not too much activity this time around. css_id is finally killed and a minor update to device_cgroup" * 'for-3.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: device_cgroup: remove can_attach cgroup: kill css_id memcg: stop using css id memcg: fail to create cgroup if the cgroup id is too big memcg: convert to use cgroup id memcg: convert to use cgroup_is_descendant()
2013-11-13memcg, kmem: use cache_from_memcg_idx instead of hard codeQiang Huang
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13memcg, kmem: use is_root_cache instead of hard codeQiang Huang
Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Reviewed-by: Pekka Enberg <penberg@kernel.org> Acked-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13memcg: support hierarchical memory.numa_statsYing Han
The memory.numa_stat file was not hierarchical. Memory charged to the children was not shown in parent's numa_stat. This change adds the "hierarchical_" stats to the existing stats. The new hierarchical stats include the sum of all children's values in addition to the value of the memcg. Tested: Create cgroup a, a/b and run workload under b. The values of b are included in the "hierarchical_*" under a. $ cd /sys/fs/cgroup $ echo 1 > memory.use_hierarchy $ mkdir a a/b Run workload in a/b: $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) & The hierarchical_ fields in parent (a) show use of workload in a/b: $ cat a/memory.numa_stat total=0 N0=0 N1=0 N2=0 N3=0 file=0 N0=0 N1=0 N2=0 N3=0 anon=0 N0=0 N1=0 N2=0 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 $ cat a/b/memory.numa_stat total=908 N0=552 N1=317 N2=39 N3=0 file=850 N0=549 N1=301 N2=0 N3=0 anon=58 N0=3 N1=16 N2=39 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 Signed-off-by: Ying Han <yinghan@google.com> Signed-off-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13memcg: refactor mem_control_numa_stat_show()Greg Thelen
Refactor mem_control_numa_stat_show() to use a new stats structure for smaller and simpler code. This consolidates nearly identical code. text data bss dec hex filename 8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before 8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after Signed-off-by: Greg Thelen <gthelen@google.com> Signed-off-by: Ying Han <yinghan@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-13mm: add a helper function to check may oom conditionQiang Huang
Use helper function to check if we need to deal with oom condition. Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-11-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/emulex/benet/be.h drivers/net/netconsole.c net/bridge/br_private.h Three mostly trivial conflicts. The net/bridge/br_private.h conflict was a function signature (argument addition) change overlapping with the extern removals from Joe Perches. In drivers/net/netconsole.c we had one change adjusting a printk message whilst another changed "printk(KERN_INFO" into "pr_info(". Lastly, the emulex change was a new inline function addition overlapping with Joe Perches's extern removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-11-01memcg: remove incorrect underflow checkGreg Thelen
When a memcg is deleted mem_cgroup_reparent_charges() moves charged memory to the parent memcg. As of v3.11-9444-g3ea67d0 "memcg: add per cgroup writeback pages accounting" there's bad pointer read. The goal was to check for counter underflow. The counter is a per cpu counter and there are two problems with the code: (1) per cpu access function isn't used, instead a naked pointer is used which easily causes oops. (2) the check doesn't sum all cpus Test: $ cd /sys/fs/cgroup/memory $ mkdir x $ echo 3 > /proc/sys/vm/drop_caches $ (echo $BASHPID >> x/tasks && exec cat) & [1] 7154 $ grep ^mapped x/memory.stat mapped_file 53248 $ echo 7154 > tasks $ rmdir x <OOPS> The fix is to remove the check. It's currently dangerous and isn't worth fixing it to use something expensive, such as percpu_counter_sum(), for each reparented page. __this_cpu_read() isn't enough to fix this because there's no guarantees of the current cpus count. The only guarantees is that the sum of all per-cpu counter is >= nr_pages. Fixes: 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting") Reported-and-tested-by: Flavio Leitner <fbl@redhat.com> Signed-off-by: Greg Thelen <gthelen@google.com> Reviewed-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-31mm: memcg: fix test for child groupsJohannes Weiner
When memcg code needs to know whether any given memcg has children, it uses the cgroup child iteration primitives and returns true/false depending on whether the iteration loop is executed at least once or not. Because a cgroup's list of children is RCU protected, these primitives require the RCU read-lock to be held, which is not the case for all memcg callers. This results in the following splat when e.g. enabling hierarchy mode: WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160() CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18 Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012 Call Trace: dump_stack+0x54/0x74 warn_slowpath_common+0x78/0xa0 warn_slowpath_null+0x1a/0x20 css_next_child+0xa3/0x160 mem_cgroup_hierarchy_write+0x5b/0xa0 cgroup_file_write+0x108/0x2a0 vfs_write+0xbd/0x1e0 SyS_write+0x4c/0xa0 system_call_fastpath+0x16/0x1b In the memcg case, we only care about children when we are attempting to modify inheritable attributes interactively. Racing with deletion could mean a spurious -EBUSY, no problem. Racing with addition is handled just fine as well through the memcg_create_mutex: if the child group is not on the list after the mutex is acquired, it won't be initialized from the parent's attributes until after the unlock. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-31mm: memcg: lockdep annotation for memcg OOM lockJohannes Weiner
The memcg OOM lock is a mutex-type lock that is open-coded due to memcg's special needs. Add annotations for lockdep coverage. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-31mm: memcg: use proper memcg in limit bypassJohannes Weiner
Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the allocator") allowed __GFP_NOFAIL allocations to bypass the limit if they fail to reclaim enough memory for the charge. But because the main test case was on a 3.2-based system, the patch missed the fact that on newer kernels the charge function needs to return root_mem_cgroup when bypassing the limit, and not NULL. This will corrupt whatever memory is at NULL + percpu pointer offset. Fix this quickly before problems are reported. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-30memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend castingGreg Thelen
As of commit 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting") memcg counter errors are possible when moving charged memory to a different memcg. Charge movement occurs when processing writes to memory.force_empty, moving tasks to a memcg with memcg.move_charge_at_immigrate=1, or memcg deletion. An example showing error after memory.force_empty: $ cd /sys/fs/cgroup/memory $ mkdir x $ rm /data/tmp/file $ (echo $BASHPID >> x/tasks && exec mmap_writer /data/tmp/file 1M) & [1] 13600 $ grep ^mapped x/memory.stat mapped_file 1048576 $ echo 13600 > tasks $ echo 1 > x/memory.force_empty $ grep ^mapped x/memory.stat mapped_file 4503599627370496 mapped_file should end with 0. 4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages 1048576 == 0x10,0000 == 0x100 pages This issue only affects the source memcg on 64 bit machines; the destination memcg counters are correct. So the rmdir case is not too important because such counters are soon disappearing with the entire memcg. But the memcg.force_empty and memory.move_charge_at_immigrate=1 cases are larger problems as the bogus counters are visible for the (possibly long) remaining life of the source memcg. The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which is subtly wrong because it subtracts the unsigned int nr_pages (either -1 or -512 for THP) from a signed long percpu counter. When nr_pages=-1, -nr_pages=0xffffffff. On 64 bit machines stat->count[idx] is signed 64 bit. So memcg's attempt to simply decrement a count (e.g. from 1 to 0) boils down to: long count = 1 unsigned int nr_pages = 1 count += -nr_pages /* -nr_pages == 0xffff,ffff */ count is now 0x1,0000,0000 instead of 0 The fix is to subtract the unsigned page count rather than adding its negation. This only works once "percpu: fix this_cpu_sub() subtrahend casting for unsigneds" is applied to fix this_cpu_sub(). Signed-off-by: Greg Thelen <gthelen@google.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/usb/qmi_wwan.c include/net/dst.h Trivial merge conflicts, both were overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-21tcp_memcontrol: Kill struct tcp_memcontrolEric W. Biederman
Replace the pointers in struct cg_proto with actual data fields and kill struct tcp_memcontrol as it is not fully redundant. This removes a confusing, unnecessary layer of abstraction. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-10-16fs: buffer: move allocation failure loop into the allocatorJohannes Weiner
Buffer allocation has a very crude indefinite loop around waking the flusher threads and performing global NOFS direct reclaim because it can not handle allocation failures. The most immediate problem with this is that the allocation may fail due to a memory cgroup limit, where flushers + direct reclaim might not make any progress towards resolving the situation at all. Because unlike the global case, a memory cgroup may not have any cache at all, only anonymous pages but no swap. This situation will lead to a reclaim livelock with insane IO from waking the flushers and thrashing unrelated filesystem cache in a tight loop. Use __GFP_NOFAIL allocations for buffers for now. This makes sure that any looping happens in the page allocator, which knows how to orchestrate kswapd, direct reclaim, and the flushers sensibly. It also allows memory cgroups to detect allocations that can't handle failure and will allow them to ultimately bypass the limit if reclaim can not make progress. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-16mm: memcg: handle non-error OOM situations more gracefullyJohannes Weiner
Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") assumed that only a few places that can trigger a memcg OOM situation do not return VM_FAULT_OOM, like optional page cache readahead. But there are many more and it's impractical to annotate them all. First of all, we don't want to invoke the OOM killer when the failed allocation is gracefully handled, so defer the actual kill to the end of the fault handling as well. This simplifies the code quite a bit for added bonus. Second, since a failed allocation might not be the abrupt end of the fault, the memcg OOM handler needs to be re-entrant until the fault finishes for subsequent allocation attempts. If an allocation is attempted after the task already OOMed, allow it to bypass the limit so that it can quickly finish the fault and invoke the OOM killer. Reported-by: azurIt <azurit@pobox.sk> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-10-16mm, memcg: protect mem_cgroup_read_events for cpu hotplugDavid Rientjes
for_each_online_cpu() needs the protection of {get,put}_online_cpus() so cpu_online_mask doesn't change during the iteration. cpu_hotplug.lock is held while a cpu is going down, it's a coarse lock that is used kernel-wide to synchronize cpu hotplug activity. Memcg has a cpu hotplug notifier, called while there may not be any cpu hotplug refcounts, which drains per-cpu event counts to memcg->nocpu_base.events to maintain a cumulative event count as cpus disappear. Without get_online_cpus() in mem_cgroup_read_events(), it's possible to account for the event count on a dying cpu twice, and this value may be significantly large. In fact, all memcg->pcp_counter_lock use should be nested by {get,put}_online_cpus(). This fixes that issue and ensures the reported statistics are not vastly over-reported during cpu hotplug. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24revert "memcg, vmscan: integrate soft reclaim tighter with zone shrinking code"Andrew Morton
Revert commit 3b38722efd9f ("memcg, vmscan: integrate soft reclaim tighter with zone shrinking code") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24revert "memcg: get rid of soft-limit tree infrastructure"Andrew Morton
Revert commit e883110aad71 ("memcg: get rid of soft-limit tree infrastructure") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24revert "vmscan, memcg: do softlimit reclaim also for targeted reclaim"Andrew Morton
Revert commit a5b7c87f9207 ("vmscan, memcg: do softlimit reclaim also for targeted reclaim") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24revert "memcg: enhance memcg iterator to support predicates"Andrew Morton
Revert commit de57780dc659 ("memcg: enhance memcg iterator to support predicates") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24revert "memcg: track children in soft limit excess to improve soft limit"Andrew Morton
Revert commit 7d910c054be4 ("memcg: track children in soft limit excess to improve soft limit") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24revert "memcg, vmscan: do not attempt soft limit reclaim if it would not ↵Andrew Morton
scan anything" Revert commit e839b6a1c8d0 ("memcg, vmscan: do not attempt soft limit reclaim if it would not scan anything") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-24revert "memcg: track all children over limit in the root"Andrew Morton
Revert commit 1be171d60bdd ("memcg: track all children over limit in the root") I merged this prematurely - Michal and Johannes still disagree about the overall design direction and the future remains unclear. Cc: Michal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-23memcg: stop using css idLi Zefan
Now memcg uses cgroup id instead of css id. Update some comments and set mem_cgroup_subsys->use_id to 0. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-09-23memcg: fail to create cgroup if the cgroup id is too bigLi Zefan
memcg requires the cgroup id to be smaller than 65536. This is a preparation to kill css id. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-09-23memcg: convert to use cgroup idLi Zefan
Use cgroup id instead of css id. This is a preparation to kill css id. Note, as memcg treat 0 as an invalid id, while cgroup id starts with 0, we define memcg_id == cgroup_id + 1. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-09-23memcg: convert to use cgroup_is_descendant()Li Zefan
This is a preparation to kill css_id. Signed-off-by: Li Zefan <lizefan@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Tejun Heo <tj@kernel.org>
2013-09-12memcg: add per cgroup writeback pages accountingSha Zhengju
Add memcg routines to count writeback pages, later dirty pages will also be accounted. After Kame's commit 89c06bd52fb9 ("memcg: use new logic for page stat accounting"), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between "move" and "page stat accounting". So in order to avoid the race we have designed a new lock: mem_cgroup_begin_update_page_stat() modify page information -->(a) mem_cgroup_update_page_stat() -->(b) mem_cgroup_end_update_page_stat() It requires both (a) and (b)(writeback pages accounting) to be pretected in mem_cgroup_{begin/end}_update_page_stat(). It's full no-op for !CONFIG_MEMCG, almost no-op if memcg is disabled (but compiled in), rcu read lock in the most cases (no task is moving), and spin_lock_irqsave on top in the slow path. There're two writeback interfaces to modify: test_{clear/set}_page_writeback(). And the lock order is: --> memcg->move_lock --> mapping->tree_lock Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: check for proper lock held in mem_cgroup_update_page_statSha Zhengju
We should call mem_cgroup_begin_update_page_stat() before mem_cgroup_update_page_stat() to get proper locks, however the latter doesn't do any checking that we use proper locking, which would be hard. Suggested by Michal Hock we could at least test for rcu_read_lock_held() because RCU is held if !mem_cgroup_disabled(). Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: remove MEMCG_NR_FILE_MAPPEDSha Zhengju
While accounting memcg page stat, it's not worth to use MEMCG_NR_FILE_MAPPED as an extra layer of indirection because of the complexity and presumed performance overhead. We can use MEM_CGROUP_STAT_FILE_MAPPED directly. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Fengguang Wu <fengguang.wu@intel.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: rename RESOURCE_MAX to RES_COUNTER_MAXSha Zhengju
RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12mm: memcg: do not trap chargers with full callstack on OOMJohannes Weiner
The memcg OOM handling is incredibly fragile and can deadlock. When a task fails to charge memory, it invokes the OOM killer and loops right there in the charge code until it succeeds. Comparably, any other task that enters the charge path at this point will go to a waitqueue right then and there and sleep until the OOM situation is resolved. The problem is that these tasks may hold filesystem locks and the mmap_sem; locks that the selected OOM victim may need to exit. For example, in one reported case, the task invoking the OOM killer was about to charge a page cache page during a write(), which holds the i_mutex. The OOM killer selected a task that was just entering truncate() and trying to acquire the i_mutex: OOM invoking task: mem_cgroup_handle_oom+0x241/0x3b0 mem_cgroup_cache_charge+0xbe/0xe0 add_to_page_cache_locked+0x4c/0x140 add_to_page_cache_lru+0x22/0x50 grab_cache_page_write_begin+0x8b/0xe0 ext3_write_begin+0x88/0x270 generic_file_buffered_write+0x116/0x290 __generic_file_aio_write+0x27c/0x480 generic_file_aio_write+0x76/0xf0 # takes ->i_mutex do_sync_write+0xea/0x130 vfs_write+0xf3/0x1f0 sys_write+0x51/0x90 system_call_fastpath+0x18/0x1d OOM kill victim: do_truncate+0x58/0xa0 # takes i_mutex do_last+0x250/0xa30 path_openat+0xd7/0x440 do_filp_open+0x49/0xa0 do_sys_open+0x106/0x240 sys_open+0x20/0x30 system_call_fastpath+0x18/0x1d The OOM handling task will retry the charge indefinitely while the OOM killed task is not releasing any resources. A similar scenario can happen when the kernel OOM killer for a memcg is disabled and a userspace task is in charge of resolving OOM situations. In this case, ALL tasks that enter the OOM path will be made to sleep on the OOM waitqueue and wait for userspace to free resources or increase the group's limit. But a userspace OOM handler is prone to deadlock itself on the locks held by the waiting tasks. For example one of the sleeping tasks may be stuck in a brk() call with the mmap_sem held for writing but the userspace handler, in order to pick an optimal victim, may need to read files from /proc/<pid>, which tries to acquire the same mmap_sem for reading and deadlocks. This patch changes the way tasks behave after detecting a memcg OOM and makes sure nobody loops or sleeps with locks held: 1. When OOMing in a user fault, invoke the OOM killer and restart the fault instead of looping on the charge attempt. This way, the OOM victim can not get stuck on locks the looping task may hold. 2. When OOMing in a user fault but somebody else is handling it (either the kernel OOM killer or a userspace handler), don't go to sleep in the charge context. Instead, remember the OOMing memcg in the task struct and then fully unwind the page fault stack with -ENOMEM. pagefault_out_of_memory() will then call back into the memcg code to check if the -ENOMEM came from the memcg, and then either put the task to sleep on the memcg's OOM waitqueue or just restart the fault. The OOM victim can no longer get stuck on any lock a sleeping task may hold. Debugged by Michal Hocko. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: azurIt <azurit@pobox.sk> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12mm: memcg: rework and document OOM waiting and wakeupJohannes Weiner
The memcg OOM handler open-codes a sleeping lock for OOM serialization (trylock, wait, repeat) because the required locking is so specific to memcg hierarchies. However, it would be nice if this construct would be clearly recognizable and not be as obfuscated as it is right now. Clean up as follows: 1. Remove the return value of mem_cgroup_oom_unlock() 2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock(). 3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This makes it more obvious that the task has to be on the waitqueue before attempting to OOM-trylock the hierarchy, to not miss any wakeups before going to sleep. It just didn't matter until now because it was all lumped together into the global memcg_oom_lock spinlock section. 4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope. It is proctected by the hierarchical OOM-lock. 5. The memcg_oom_lock spinlock is only required to propagate the OOM lock in any given hierarchy atomically. Restrict its scope to mem_cgroup_oom_(trylock|unlock). 6. Do not wake up the waitqueue unconditionally at the end of the function. Only the lockholder has to wake up the next in line after releasing the lock. Note that the lockholder kicks off the OOM-killer, which in turn leads to wakeups from the uncharges of the exiting task. But a contender is not guaranteed to see them if it enters the OOM path after the OOM kills but before the lockholder releases the lock. Thus there has to be an explicit wakeup after releasing the lock. 7. Put the OOM task on the waitqueue before marking the hierarchy as under OOM as that is the point where we start to receive wakeups. No point in listening before being on the waitqueue. 8. Likewise, unmark the hierarchy before finishing the sleep, for symmetry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12mm: memcg: enable memcg OOM killer only for user faultsJohannes Weiner
System calls and kernel faults (uaccess, gup) can handle an out of memory situation gracefully and just return -ENOMEM. Enable the memcg OOM killer only for user faults, where it's really the only option available. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: azurIt <azurit@pobox.sk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: trivial cleanupsAndrew Morton
Clean up some mess made by the "Soft limit rework" series, and a few other things. Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: track all children over limit in the rootMichal Hocko
Children in soft limit excess are currently tracked up the hierarchy in memcg->children_in_excess. Nevertheless there still might exist tons of groups that are not in hierarchy relation to the root cgroup (e.g. all first level groups if root_mem_cgroup->use_hierarchy == false). As the whole tree walk has to be done when the iteration starts at root_mem_cgroup the iterator should be able to skip the walk if there is no child above the limit without iterating them. This can be done easily if the root tracks all children rather than only hierarchical children. This is done by this patch which updates root_mem_cgroup children_in_excess if root_mem_cgroup->use_hierarchy == false so the root knows about all children in excess. Please note that this is not an issue for inner memcgs which have use_hierarchy == false because then only the single group is visited so no special optimization is necessary. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg, vmscan: do not attempt soft limit reclaim if it would not scan anythingMichal Hocko
mem_cgroup_should_soft_reclaim controls whether soft reclaim pass is done and it always says yes currently. Memcg iterators are clever to skip nodes that are not soft reclaimable quite efficiently but mem_cgroup_should_soft_reclaim can be more clever and do not start the soft reclaim pass at all if it knows that nothing would be scanned anyway. In order to do that, simply reuse mem_cgroup_soft_reclaim_eligible for the target group of the reclaim and allow the pass only if the whole subtree wouldn't be skipped. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: track children in soft limit excess to improve soft limitMichal Hocko
Soft limit reclaim has to check the whole reclaim hierarchy while doing the first pass of the reclaim. This leads to a higher system time which can be visible especially when there are many groups in the hierarchy. This patch adds a per-memcg counter of children in excess. It also restores MEM_CGROUP_TARGET_SOFTLIMIT into mem_cgroup_event_ratelimit for a proper batching. If a group crosses soft limit for the first time it increases parent's children_in_excess up the hierarchy. The similarly if a group gets below the limit it will decrease the counter. The transition phase is recorded in soft_contributed flag. mem_cgroup_soft_reclaim_eligible then uses this information to better decide whether to skip the node or the whole subtree. The rule is simple. Skip the node with a children in excess or skip the whole subtree otherwise. This has been tested by a stream IO (dd if=/dev/zero of=file with 4*MemTotal size) which is quite sensitive to overhead during reclaim. The load is running in a group with soft limit set to 0 and without any limit. Apart from that there was a hierarchy with ~500, 2k and 8k groups (two groups on each level) without any pages in them. base denotes to the kernel on which the whole series is based on, rework is the kernel before this patch and reworkoptim is with this patch applied: * Run with soft limit set to 0 Elapsed 0-0-limit/base: min: 88.21 max: 94.61 avg: 91.73 std: 2.65 runs: 3 0-0-limit/rework: min: 76.05 [86.2%] max: 79.08 [83.6%] avg: 77.84 [84.9%] std: 1.30 runs: 3 0-0-limit/reworkoptim: min: 77.98 [88.4%] max: 80.36 [84.9%] avg: 78.92 [86.0%] std: 1.03 runs: 3 System 0.5k-0-limit/base: min: 34.86 max: 36.42 avg: 35.89 std: 0.73 runs: 3 0.5k-0-limit/rework: min: 43.26 [124.1%] max: 48.95 [134.4%] avg: 46.09 [128.4%] std: 2.32 runs: 3 0.5k-0-limit/reworkoptim: min: 46.98 [134.8%] max: 50.98 [140.0%] avg: 48.49 [135.1%] std: 1.77 runs: 3 Elapsed 0.5k-0-limit/base: min: 88.50 max: 97.52 avg: 93.92 std: 3.90 runs: 3 0.5k-0-limit/rework: min: 75.92 [85.8%] max: 78.45 [80.4%] avg: 77.34 [82.3%] std: 1.06 runs: 3 0.5k-0-limit/reworkoptim: min: 75.79 [85.6%] max: 79.37 [81.4%] avg: 77.55 [82.6%] std: 1.46 runs: 3 System 2k-0-limit/base: min: 34.57 max: 37.65 avg: 36.34 std: 1.30 runs: 3 2k-0-limit/rework: min: 64.17 [185.6%] max: 68.20 [181.1%] avg: 66.21 [182.2%] std: 1.65 runs: 3 2k-0-limit/reworkoptim: min: 49.78 [144.0%] max: 52.99 [140.7%] avg: 51.00 [140.3%] std: 1.42 runs: 3 Elapsed 2k-0-limit/base: min: 92.61 max: 97.83 avg: 95.03 std: 2.15 runs: 3 2k-0-limit/rework: min: 78.33 [84.6%] max: 84.08 [85.9%] avg: 81.09 [85.3%] std: 2.35 runs: 3 2k-0-limit/reworkoptim: min: 75.72 [81.8%] max: 78.57 [80.3%] avg: 76.73 [80.7%] std: 1.30 runs: 3 System 8k-0-limit/base: min: 39.78 max: 42.09 avg: 41.09 std: 0.97 runs: 3 8k-0-limit/rework: min: 200.86 [504.9%] max: 265.42 [630.6%] avg: 241.80 [588.5%] std: 29.06 runs: 3 8k-0-limit/reworkoptim: min: 53.70 [135.0%] max: 54.89 [130.4%] avg: 54.43 [132.5%] std: 0.52 runs: 3 Elapsed 8k-0-limit/base: min: 95.11 max: 98.61 avg: 96.81 std: 1.43 runs: 3 8k-0-limit/rework: min: 246.96 [259.7%] max: 331.47 [336.1%] avg: 301.32 [311.2%] std: 38.52 runs: 3 8k-0-limit/reworkoptim: min: 76.79 [80.7%] max: 81.71 [82.9%] avg: 78.97 [81.6%] std: 2.05 runs: 3 System time is increased by 30-40% but it is reduced a lot comparing to kernel without this patch. The higher time can be explained by the fact that the original soft reclaim scanned at priority 0 so it was much more effective for this workload (which is basically touch once and writeback). The Elapsed time looks better though (~20%). * Run with no soft limit set System 0-no-limit/base: min: 42.18 max: 50.38 avg: 46.44 std: 3.36 runs: 3 0-no-limit/rework: min: 40.57 [96.2%] max: 47.04 [93.4%] avg: 43.82 [94.4%] std: 2.64 runs: 3 0-no-limit/reworkoptim: min: 40.45 [95.9%] max: 45.28 [89.9%] avg: 42.10 [90.7%] std: 2.25 runs: 3 Elapsed 0-no-limit/base: min: 75.97 max: 78.21 avg: 76.87 std: 0.96 runs: 3 0-no-limit/rework: min: 75.59 [99.5%] max: 80.73 [103.2%] avg: 77.64 [101.0%] std: 2.23 runs: 3 0-no-limit/reworkoptim: min: 77.85 [102.5%] max: 82.42 [105.4%] avg: 79.64 [103.6%] std: 1.99 runs: 3 System 0.5k-no-limit/base: min: 44.54 max: 46.93 avg: 46.12 std: 1.12 runs: 3 0.5k-no-limit/rework: min: 42.09 [94.5%] max: 46.16 [98.4%] avg: 43.92 [95.2%] std: 1.69 runs: 3 0.5k-no-limit/reworkoptim: min: 42.47 [95.4%] max: 45.67 [97.3%] avg: 44.06 [95.5%] std: 1.31 runs: 3 Elapsed 0.5k-no-limit/base: min: 78.26 max: 81.49 avg: 79.65 std: 1.36 runs: 3 0.5k-no-limit/rework: min: 77.01 [98.4%] max: 80.43 [98.7%] avg: 78.30 [98.3%] std: 1.52 runs: 3 0.5k-no-limit/reworkoptim: min: 76.13 [97.3%] max: 77.87 [95.6%] avg: 77.18 [96.9%] std: 0.75 runs: 3 System 2k-no-limit/base: min: 62.96 max: 69.14 avg: 66.14 std: 2.53 runs: 3 2k-no-limit/rework: min: 76.01 [120.7%] max: 81.06 [117.2%] avg: 78.17 [118.2%] std: 2.12 runs: 3 2k-no-limit/reworkoptim: min: 62.57 [99.4%] max: 66.10 [95.6%] avg: 64.53 [97.6%] std: 1.47 runs: 3 Elapsed 2k-no-limit/base: min: 76.47 max: 84.22 avg: 79.12 std: 3.60 runs: 3 2k-no-limit/rework: min: 89.67 [117.3%] max: 93.26 [110.7%] avg: 91.10 [115.1%] std: 1.55 runs: 3 2k-no-limit/reworkoptim: min: 76.94 [100.6%] max: 79.21 [94.1%] avg: 78.45 [99.2%] std: 1.07 runs: 3 System 8k-no-limit/base: min: 104.74 max: 151.34 avg: 129.21 std: 19.10 runs: 3 8k-no-limit/rework: min: 205.23 [195.9%] max: 285.94 [188.9%] avg: 258.98 [200.4%] std: 38.01 runs: 3 8k-no-limit/reworkoptim: min: 161.16 [153.9%] max: 184.54 [121.9%] avg: 174.52 [135.1%] std: 9.83 runs: 3 Elapsed 8k-no-limit/base: min: 125.43 max: 181.00 avg: 154.81 std: 22.80 runs: 3 8k-no-limit/rework: min: 254.05 [202.5%] max: 355.67 [196.5%] avg: 321.46 [207.6%] std: 47.67 runs: 3 8k-no-limit/reworkoptim: min: 193.77 [154.5%] max: 222.72 [123.0%] avg: 210.18 [135.8%] std: 12.13 runs: 3 Both System and Elapsed are in stdev with the base kernel for all configurations except for 8k where both System and Elapsed are up by 35%. I do not have a good explanation for this because there is no soft reclaim pass going on as no group is above the limit which is checked in mem_cgroup_should_soft_reclaim. Then I have tested kernel build with the same configuration to see the behavior with a more general behavior. * Soft limit set to 0 for the build System 0-0-limit/base: min: 242.70 max: 245.17 avg: 243.85 std: 1.02 runs: 3 0-0-limit/rework min: 237.86 [98.0%] max: 240.22 [98.0%] avg: 239.00 [98.0%] std: 0.97 runs: 3 0-0-limit/reworkoptim: min: 241.11 [99.3%] max: 243.53 [99.3%] avg: 242.01 [99.2%] std: 1.08 runs: 3 Elapsed 0-0-limit/base: min: 348.48 max: 360.86 avg: 356.04 std: 5.41 runs: 3 0-0-limit/rework min: 286.95 [82.3%] max: 290.26 [80.4%] avg: 288.27 [81.0%] std: 1.43 runs: 3 0-0-limit/reworkoptim: min: 286.55 [82.2%] max: 289.00 [80.1%] avg: 287.69 [80.8%] std: 1.01 runs: 3 System 0.5k-0-limit/base: min: 251.77 max: 254.41 avg: 252.70 std: 1.21 runs: 3 0.5k-0-limit/rework min: 286.44 [113.8%] max: 289.30 [113.7%] avg: 287.60 [113.8%] std: 1.23 runs: 3 0.5k-0-limit/reworkoptim: min: 252.18 [100.2%] max: 253.16 [99.5%] avg: 252.62 [100.0%] std: 0.41 runs: 3 Elapsed 0.5k-0-limit/base: min: 347.83 max: 353.06 avg: 350.04 std: 2.21 runs: 3 0.5k-0-limit/rework min: 290.19 [83.4%] max: 295.62 [83.7%] avg: 293.12 [83.7%] std: 2.24 runs: 3 0.5k-0-limit/reworkoptim: min: 293.91 [84.5%] max: 294.87 [83.5%] avg: 294.29 [84.1%] std: 0.42 runs: 3 System 2k-0-limit/base: min: 263.05 max: 271.52 avg: 267.94 std: 3.58 runs: 3 2k-0-limit/rework min: 458.99 [174.5%] max: 468.31 [172.5%] avg: 464.45 [173.3%] std: 3.97 runs: 3 2k-0-limit/reworkoptim: min: 267.10 [101.5%] max: 279.38 [102.9%] avg: 272.78 [101.8%] std: 5.05 runs: 3 Elapsed 2k-0-limit/base: min: 372.33 max: 379.32 avg: 375.47 std: 2.90 runs: 3 2k-0-limit/rework min: 334.40 [89.8%] max: 339.52 [89.5%] avg: 337.44 [89.9%] std: 2.20 runs: 3 2k-0-limit/reworkoptim: min: 301.47 [81.0%] max: 319.19 [84.1%] avg: 307.90 [82.0%] std: 8.01 runs: 3 System 8k-0-limit/base: min: 320.50 max: 332.10 avg: 325.46 std: 4.88 runs: 3 8k-0-limit/rework min: 1115.76 [348.1%] max: 1165.66 [351.0%] avg: 1132.65 [348.0%] std: 23.34 runs: 3 8k-0-limit/reworkoptim: min: 403.75 [126.0%] max: 409.22 [123.2%] avg: 406.16 [124.8%] std: 2.28 runs: 3 Elapsed 8k-0-limit/base: min: 475.48 max: 585.19 avg: 525.54 std: 45.30 runs: 3 8k-0-limit/rework min: 616.25 [129.6%] max: 625.90 [107.0%] avg: 620.68 [118.1%] std: 3.98 runs: 3 8k-0-limit/reworkoptim: min: 420.18 [88.4%] max: 428.28 [73.2%] avg: 423.05 [80.5%] std: 3.71 runs: 3 Apart from 8k the system time is comparable with the base kernel while Elapsed is up to 20% better with all configurations. * No soft limit set System 0-no-limit/base: min: 234.76 max: 237.42 avg: 236.25 std: 1.11 runs: 3 0-no-limit/rework min: 233.09 [99.3%] max: 238.65 [100.5%] avg: 236.09 [99.9%] std: 2.29 runs: 3 0-no-limit/reworkoptim: min: 236.12 [100.6%] max: 240.53 [101.3%] avg: 237.94 [100.7%] std: 1.88 runs: 3 Elapsed 0-no-limit/base: min: 288.52 max: 295.42 avg: 291.29 std: 2.98 runs: 3 0-no-limit/rework min: 283.17 [98.1%] max: 284.33 [96.2%] avg: 283.78 [97.4%] std: 0.48 runs: 3 0-no-limit/reworkoptim: min: 288.50 [100.0%] max: 290.79 [98.4%] avg: 289.78 [99.5%] std: 0.95 runs: 3 System 0.5k-no-limit/base: min: 286.51 max: 293.23 avg: 290.21 std: 2.78 runs: 3 0.5k-no-limit/rework min: 291.69 [101.8%] max: 294.38 [100.4%] avg: 292.97 [101.0%] std: 1.10 runs: 3 0.5k-no-limit/reworkoptim: min: 277.05 [96.7%] max: 288.76 [98.5%] avg: 284.17 [97.9%] std: 5.11 runs: 3 Elapsed 0.5k-no-limit/base: min: 294.94 max: 298.92 avg: 296.47 std: 1.75 runs: 3 0.5k-no-limit/rework min: 292.55 [99.2%] max: 294.21 [98.4%] avg: 293.55 [99.0%] std: 0.72 runs: 3 0.5k-no-limit/reworkoptim: min: 294.41 [99.8%] max: 301.67 [100.9%] avg: 297.78 [100.4%] std: 2.99 runs: 3 System 2k-no-limit/base: min: 443.41 max: 466.66 avg: 457.66 std: 10.19 runs: 3 2k-no-limit/rework min: 490.11 [110.5%] max: 516.02 [110.6%] avg: 501.42 [109.6%] std: 10.83 runs: 3 2k-no-limit/reworkoptim: min: 435.25 [98.2%] max: 458.11 [98.2%] avg: 446.73 [97.6%] std: 9.33 runs: 3 Elapsed 2k-no-limit/base: min: 330.85 max: 333.75 avg: 332.52 std: 1.23 runs: 3 2k-no-limit/rework min: 343.06 [103.7%] max: 349.59 [104.7%] avg: 345.95 [104.0%] std: 2.72 runs: 3 2k-no-limit/reworkoptim: min: 330.01 [99.7%] max: 333.92 [100.1%] avg: 332.22 [99.9%] std: 1.64 runs: 3 System 8k-no-limit/base: min: 1175.64 max: 1259.38 avg: 1222.39 std: 34.88 runs: 3 8k-no-limit/rework min: 1226.31 [104.3%] max: 1241.60 [98.6%] avg: 1233.74 [100.9%] std: 6.25 runs: 3 8k-no-limit/reworkoptim: min: 1023.45 [87.1%] max: 1056.74 [83.9%] avg: 1038.92 [85.0%] std: 13.69 runs: 3 Elapsed 8k-no-limit/base: min: 613.36 max: 619.60 avg: 616.47 std: 2.55 runs: 3 8k-no-limit/rework min: 627.56 [102.3%] max: 642.33 [103.7%] avg: 633.44 [102.8%] std: 6.39 runs: 3 8k-no-limit/reworkoptim: min: 545.89 [89.0%] max: 555.36 [89.6%] avg: 552.06 [89.6%] std: 4.37 runs: 3 and these numbers look good as well. System time is around 100% (suprisingly better for the 8k case) and Elapsed is copies that trend. Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: enhance memcg iterator to support predicatesMichal Hocko
The caller of the iterator might know that some nodes or even subtrees should be skipped but there is no way to tell iterators about that so the only choice left is to let iterators to visit each node and do the selection outside of the iterating code. This, however, doesn't scale well with hierarchies with many groups where only few groups are interesting. This patch adds mem_cgroup_iter_cond variant of the iterator with a callback which gets called for every visited node. There are three possible ways how the callback can influence the walk. Either the node is visited, it is skipped but the tree walk continues down the tree or the whole subtree of the current group is skipped. [hughd@google.com: fix memcg-less page reclaim] Signed-off-by: Michal Hocko <mhocko@suse.cz> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Glauber Costa <glommer@openvz.org> Cc: Greg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michel Lespinasse <walken@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>