aboutsummaryrefslogtreecommitdiff
path: root/kernel
AgeCommit message (Collapse)Author
2013-10-25keys: change asymmetric keys to use common hash definitionsDmitry Kasatkin
This patch makes use of the newly defined common hash algorithm info, replacing, for example, PKEY_HASH with HASH_ALGO. Changelog: - Lindent fixes - Mimi CC: David Howells <dhowells@redhat.com> Signed-off-by: Dmitry Kasatkin <d.kasatkin@samsung.com> Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
2013-09-25KEYS: Make the system 'trusted' keyring viewable by userspaceMimi Zohar
Give the root user the ability to read the system keyring and put read permission on the trusted keys added during boot. The latter is actually more theoretical than real for the moment as asymmetric keys do not currently provide a read operation. Signed-off-by: Mimi Zohar <zohar@us.ibm.com> Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-25KEYS: Add a 'trusted' flag and a 'trusted only' flagDavid Howells
Add KEY_FLAG_TRUSTED to indicate that a key either comes from a trusted source or had a cryptographic signature chain that led back to a trusted key the kernel already possessed. Add KEY_FLAGS_TRUSTED_ONLY to indicate that a keyring will only accept links to keys marked with KEY_FLAGS_TRUSTED. Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org>
2013-09-25KEYS: Separate the kernel signature checking keyring from module signingDavid Howells
Separate the kernel signature checking keyring from module signing so that it can be used by code other than the module-signing code. Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-25KEYS: Have make canonicalise the paths of the X.509 certs better to deduplicateDavid Howells
Have make canonicalise the paths of the X.509 certificates before we sort them as this allows $(sort) to better remove duplicates. Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-25KEYS: Load *.x509 files into kernel keyringDavid Howells
Load all the files matching the pattern "*.x509" that are to be found in kernel base source dir and base build dir into the module signing keyring. The "extra_certificates" file is then redundant. Signed-off-by: David Howells <dhowells@redhat.com>
2013-09-25KEYS: Rename public key parameter name arraysDavid Howells
Rename the arrays of public key parameters (public key algorithm names, hash algorithm names and ID type names) so that the array name ends in "_name". Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Josh Boyer <jwboyer@redhat.com>
2013-09-24KEYS: Add per-user_namespace registers for persistent per-UID kerberos cachesDavid Howells
Add support for per-user_namespace registers of persistent per-UID kerberos caches held within the kernel. This allows the kerberos cache to be retained beyond the life of all a user's processes so that the user's cron jobs can work. The kerberos cache is envisioned as a keyring/key tree looking something like: struct user_namespace \___ .krb_cache keyring - The register \___ _krb.0 keyring - Root's Kerberos cache \___ _krb.5000 keyring - User 5000's Kerberos cache \___ _krb.5001 keyring - User 5001's Kerberos cache \___ tkt785 big_key - A ccache blob \___ tkt12345 big_key - Another ccache blob Or possibly: struct user_namespace \___ .krb_cache keyring - The register \___ _krb.0 keyring - Root's Kerberos cache \___ _krb.5000 keyring - User 5000's Kerberos cache \___ _krb.5001 keyring - User 5001's Kerberos cache \___ tkt785 keyring - A ccache \___ krbtgt/REDHAT.COM@REDHAT.COM big_key \___ http/REDHAT.COM@REDHAT.COM user \___ afs/REDHAT.COM@REDHAT.COM user \___ nfs/REDHAT.COM@REDHAT.COM user \___ krbtgt/KERNEL.ORG@KERNEL.ORG big_key \___ http/KERNEL.ORG@KERNEL.ORG big_key What goes into a particular Kerberos cache is entirely up to userspace. Kernel support is limited to giving you the Kerberos cache keyring that you want. The user asks for their Kerberos cache by: krb_cache = keyctl_get_krbcache(uid, dest_keyring); The uid is -1 or the user's own UID for the user's own cache or the uid of some other user's cache (requires CAP_SETUID). This permits rpc.gssd or whatever to mess with the cache. The cache returned is a keyring named "_krb.<uid>" that the possessor can read, search, clear, invalidate, unlink from and add links to. Active LSMs get a chance to rule on whether the caller is permitted to make a link. Each uid's cache keyring is created when it first accessed and is given a timeout that is extended each time this function is called so that the keyring goes away after a while. The timeout is configurable by sysctl but defaults to three days. Each user_namespace struct gets a lazily-created keyring that serves as the register. The cache keyrings are added to it. This means that standard key search and garbage collection facilities are available. The user_namespace struct's register goes away when it does and anything left in it is then automatically gc'd. Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Simo Sorce <simo@redhat.com> cc: Serge E. Hallyn <serge.hallyn@ubuntu.com> cc: Eric W. Biederman <ebiederm@xmission.com>
2013-09-18Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer fix from Ingo Molnar: "An NTP related lockup fix" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: timekeeping: Fix HRTICK related deadlock from ntp lock changes
2013-09-18Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "Misc fixes" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix comment for sched_info_depart sched/Documentation: Update sched-design-CFS.txt documentation sched/debug: Take PID namespace into account sched/fair: Fix small race where child->se.parent,cfs_rq might point to invalid ones
2013-09-16sched: Fix comment for sched_info_departMichael S. Tsirkin
sched_info_depart seems to be only called from sched_info_switch(), so only on involuntary task switch. Fix the comment to match. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Link: http://lkml.kernel.org/r/20130916083036.GA1113@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-13Merge git://git.kvack.org/~bcrl/aio-nextLinus Torvalds
Pull aio changes from Ben LaHaise: "First off, sorry for this pull request being late in the merge window. Al had raised a couple of concerns about 2 items in the series below. I addressed the first issue (the race introduced by Gu's use of mm_populate()), but he has not provided any further details on how he wants to rework the anon_inode.c changes (which were sent out months ago but have yet to be commented on). The bulk of the changes have been sitting in the -next tree for a few months, with all the issues raised being addressed" * git://git.kvack.org/~bcrl/aio-next: (22 commits) aio: rcu_read_lock protection for new rcu_dereference calls aio: fix race in ring buffer page lookup introduced by page migration support aio: fix rcu sparse warnings introduced by ioctx table lookup patch aio: remove unnecessary debugging from aio_free_ring() aio: table lookup: verify ctx pointer staging/lustre: kiocb->ki_left is removed aio: fix error handling and rcu usage in "convert the ioctx list to table lookup v3" aio: be defensive to ensure request batching is non-zero instead of BUG_ON() aio: convert the ioctx list to table lookup v3 aio: double aio_max_nr in calculations aio: Kill ki_dtor aio: Kill ki_users aio: Kill unneeded kiocb members aio: Kill aio_rw_vect_retry() aio: Don't use ctx->tail unnecessarily aio: io_cancel() no longer returns the io_event aio: percpu ioctx refcount aio: percpu reqs_available aio: reqs_active -> reqs_available aio: fix build when migration is disabled ...
2013-09-13Remove GENERIC_HARDIRQ config optionMartin Schwidefsky
After the last architecture switched to generic hard irqs the config options HAVE_GENERIC_HARDIRQS & GENERIC_HARDIRQS and the related code for !CONFIG_GENERIC_HARDIRQS can be removed. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2013-09-12Merge branch 'akpm' (patches from Andrew Morton)Linus Torvalds
Merge more patches from Andrew Morton: "The rest of MM. Plus one misc cleanup" * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (35 commits) mm/Kconfig: add MMU dependency for MIGRATION. kernel: replace strict_strto*() with kstrto*() mm, thp: count thp_fault_fallback anytime thp fault fails thp: consolidate code between handle_mm_fault() and do_huge_pmd_anonymous_page() thp: do_huge_pmd_anonymous_page() cleanup thp: move maybe_pmd_mkwrite() out of mk_huge_pmd() mm: cleanup add_to_page_cache_locked() thp: account anon transparent huge pages into NR_ANON_PAGES truncate: drop 'oldsize' truncate_pagecache() parameter mm: make lru_add_drain_all() selective memcg: document cgroup dirty/writeback memory statistics memcg: add per cgroup writeback pages accounting memcg: check for proper lock held in mem_cgroup_update_page_stat memcg: remove MEMCG_NR_FILE_MAPPED memcg: reduce function dereference memcg: avoid overflow caused by PAGE_ALIGN memcg: rename RESOURCE_MAX to RES_COUNTER_MAX memcg: correct RESOURCE_MAX to ULLONG_MAX mm: memcg: do not trap chargers with full callstack on OOM mm: memcg: rework and document OOM waiting and wakeup ...
2013-09-12kernel: replace strict_strto*() with kstrto*()Jingoo Han
The usage of strict_strto*() is not preferred, because strict_strto*() is obsolete. Thus, kstrto*() should be used. Signed-off-by: Jingoo Han <jg1.han@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: reduce function dereferenceSha Zhengju
This function dereferences res far too often, so optimize it. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: avoid overflow caused by PAGE_ALIGNSha Zhengju
Since PAGE_ALIGN is aligning up(the next page boundary), so after PAGE_ALIGN, the value might be overflow, such as write the MAX value to *.limit_in_bytes. $ cat /cgroup/memory/memory.limit_in_bytes 18446744073709551615 # echo 18446744073709551615 > /cgroup/memory/memory.limit_in_bytes bash: echo: write error: Invalid argument Some user programs might depend on such behaviours(like libcg, we read the value in snapshot, then use the value to reset cgroup later), and that will cause confusion. So we need to fix it. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12memcg: rename RESOURCE_MAX to RES_COUNTER_MAXSha Zhengju
RESOURCE_MAX is far too general name, change it to RES_COUNTER_MAX. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Signed-off-by: Qiang Huang <h.huangqiang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Jeff Liu <jeff.liu@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 4 from Al Viro: "list_lru pile, mostly" This came out of Andrew's pile, Al ended up doing the merge work so that Andrew didn't have to. Additionally, a few fixes. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (42 commits) super: fix for destroy lrus list_lru: dynamically adjust node arrays shrinker: Kill old ->shrink API. shrinker: convert remaining shrinkers to count/scan API staging/lustre/libcfs: cleanup linux-mem.h staging/lustre/ptlrpc: convert to new shrinker API staging/lustre/obdclass: convert lu_object shrinker to count/scan API staging/lustre/ldlm: convert to shrinkers to count/scan API hugepage: convert huge zero page shrinker to new shrinker API i915: bail out earlier when shrinker cannot acquire mutex drivers: convert shrinkers to new count/scan API fs: convert fs shrinkers to new scan/count API xfs: fix dquot isolation hang xfs-convert-dquot-cache-lru-to-list_lru-fix xfs: convert dquot cache lru to list_lru xfs: rework buffer dispose list tracking xfs-convert-buftarg-lru-to-generic-code-fix xfs: convert buftarg LRU to generic code fs: convert inode and dentry shrinking to be node aware vmscan: per-node deferred work ...
2013-09-12Merge tag 'pm+acpi-fixes-3.12-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull ACPI and power management fixes from Rafael Wysocki: "All of these commits are fixes that have emerged recently and some of them fix bugs introduced during this merge window. Specifics: 1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events After the recent ACPIPHP changes we've seen some interesting breakage on a system that triggers device check notifications during boot for non-existing devices. Although those notifications are really spurious, we should be able to deal with them nevertheless and that shouldn't introduce too much overhead. Four commits to make that work properly. 2) Memory hotplug and hibernation mutual exclusion rework This was maent to be a cleanup, but it happens to fix a classical ABBA deadlock between system suspend/hibernation and ACPI memory hotplug which is possible if they are started roughly at the same time. Three commits rework memory hotplug so that it doesn't acquire pm_mutex and make hibernation use device_hotplug_lock which prevents it from racing with memory hotplug. 3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix The ACPI LPSS driver crashes during boot on Apple Macbook Air with Haswell that has slightly unusual BIOS configuration in which one of the LPSS device's _CRS method doesn't return all of the information expected by the driver. Fix from Mika Westerberg, for stable. 4) ACPICA fix related to Store->ArgX operation AML interpreter fix for obscure breakage that causes AML to be executed incorrectly on some machines (observed in practice). From Bob Moore. 5) ACPI core fix for PCI ACPI device objects lookup There still are cases in which there is more than one ACPI device object matching a given PCI device and we don't choose the one that the BIOS expects us to choose, so this makes the lookup take more criteria into account in those cases. 6) Fix to prevent cpuidle from crashing in some rare cases If the result of cpuidle_get_driver() is NULL, which can happen on some systems, cpuidle_driver_ref() will crash trying to use that pointer and the Daniel Fu's fix prevents that from happening. 7) cpufreq fixes related to CPU hotplug Stephen Boyd reported a number of concurrency problems with cpufreq related to CPU hotplug which are addressed by a series of fixes from Srivatsa S Bhat and Viresh Kumar. 8) cpufreq fix for time conversion in time_in_state attribute Time conversion carried out by cpufreq when user space attempts to read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state won't work correcty if cputime_t doesn't map directly to jiffies. Fix from Andreas Schwab. 9) Revert of a troublesome cpufreq commit Commit 7c30ed5 (cpufreq: make sure frequency transitions are serialized) was intended to address some known concurrency problems in cpufreq related to the ordering of transitions, but unfortunately it introduced several problems of its own, so I decided to revert it now and address the original problems later in a more robust way. 10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle. 11) cpufreq fixes related to system suspend/resume The recent cpufreq changes that made it preserve CPU sysfs attributes over suspend/resume cycles introduced a possible NULL pointer dereference that caused it to crash during the second attempt to suspend. Three commits from Srivatsa S Bhat fix that problem and a couple of related issues. 12) cpufreq locking fix cpufreq_policy_restore() should acquire the lock for reading, but it acquires it for writing. Fix from Lan Tianyu" * tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits) cpufreq: Acquire the lock in cpufreq_policy_restore() for reading cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu cpufreq: Restructure if/else block to avoid unintended behavior cpufreq: Fix crash in cpufreq-stats during suspend/resume intel_pstate: Add Haswell CPU models Revert "cpufreq: make sure frequency transitions are serialized" cpufreq: Use signed type for 'ret' variable, to store negative error values cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock cpufreq: Split __cpufreq_remove_dev() into two parts cpufreq: Fix wrong time unit conversion cpufreq: serialize calls to __cpufreq_governor() cpufreq: don't allow governor limits to be changed when it is disabled ACPI / bind: Prefer device objects with _STA to those without it ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks ACPI / hotplug / PCI: Use _OST to notify firmware about notify status ACPI / hotplug / PCI: Avoid doing too much for spurious notifies ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field. ACPI / hotplug / PCI: Don't trim devices before scanning the namespace ...
2013-09-12Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Various fixes. The -g perf report lockup you reported is only partially addressed, patches that fix the excessive runtime are still being worked on" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86: Fix uncore PCI fixed counter handling uprobes: Fix utask->depth accounting in handle_trampoline() perf/x86: Add constraint for IVB CYCLE_ACTIVITY:CYCLES_LDM_PENDING perf: Fix up MMAP2 buffer space reservation perf tools: Add attr->mmap2 support perf kvm: Fix sample_type manipulation perf evlist: Fix id pos in perf_evlist__open() perf trace: Handle perf.data files with no tracepoints perf session: Separate progress bar update when processing events perf trace: Check if MAP_32BIT is defined perf hists: Fix formatting of long symbol names perf evlist: Fix parsing with no sample_id_all bit set perf tools: Add test for parsing with no sample_id_all bit perf trace: Check control+C more often
2013-09-12Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Ingo Molnar: "Performance regression fix" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Fix load balancing performance regression in should_we_balance()
2013-09-12sched/debug: Take PID namespace into accountPeter Zijlstra
Emmanuel reported that /proc/sched_debug didn't report the right PIDs when using namespaces, cure this. Reported-by: Emmanuel Deloget <emmanuel.deloget@efixo.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130909110141.GM31370@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12sched/fair: Fix small race where child->se.parent,cfs_rq might point to ↵Daisuke Nishimura
invalid ones There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jp Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12uprobes: Fix utask->depth accounting in handle_trampoline()Oleg Nesterov
Currently utask->depth is simply the number of allocated/pending return_instance's in uprobe_task->return_instances list. handle_trampoline() should decrement this counter every time we handle/free an instance, but due to typo it does this only if ->chained == T. This means that in the likely case this counter is never decremented and the probed task can't report more than MAX_URETPROBE_DEPTH events. Reported-by: Mikhail Kulemin <Mikhail.Kulemin@ru.ibm.com> Reported-by: Hemant Kumar Shaw <hkshaw@linux.vnet.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Anton Arapov <anton@redhat.com> Cc: masami.hiramatsu.pt@hitachi.com Cc: srikar@linux.vnet.ibm.com Cc: systemtap@sourceware.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20130911154726.GA8093@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-12timekeeping: Fix HRTICK related deadlock from ntp lock changesJohn Stultz
Gerlando Falauto reported that when HRTICK is enabled, it is possible to trigger system deadlocks. These were hard to reproduce, as HRTICK has been broken in the past, but seemed to be connected to the timekeeping_seq lock. Since seqlock/seqcount's aren't supported w/ lockdep, I added some extra spinlock based locking and triggered the following lockdep output: [ 15.849182] ntpd/4062 is trying to acquire lock: [ 15.849765] (&(&pool->lock)->rlock){..-...}, at: [<ffffffff810aa9b5>] __queue_work+0x145/0x480 [ 15.850051] [ 15.850051] but task is already holding lock: [ 15.850051] (timekeeper_lock){-.-.-.}, at: [<ffffffff810df6df>] do_adjtimex+0x7f/0x100 <snip> [ 15.850051] Chain exists of: &(&pool->lock)->rlock --> &p->pi_lock --> timekeeper_lock [ 15.850051] Possible unsafe locking scenario: [ 15.850051] [ 15.850051] CPU0 CPU1 [ 15.850051] ---- ---- [ 15.850051] lock(timekeeper_lock); [ 15.850051] lock(&p->pi_lock); [ 15.850051] lock(timekeeper_lock); [ 15.850051] lock(&(&pool->lock)->rlock); [ 15.850051] [ 15.850051] *** DEADLOCK *** The deadlock was introduced by 06c017fdd4dc48451a ("timekeeping: Hold timekeepering locks in do_adjtimex and hardpps") in 3.10 This patch avoids this deadlock, by moving the call to schedule_delayed_work() outside of the timekeeper lock critical section. Reported-by: Gerlando Falauto <gerlando.falauto@keymile.com> Tested-by: Lin Ming <minggr@gmail.com> Signed-off-by: John Stultz <john.stultz@linaro.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: stable <stable@vger.kernel.org> #3.11, 3.10 Link: http://lkml.kernel.org/r/1378943457-27314-1-git-send-email-john.stultz@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2013-09-11panic: call panic handlers before kmsg_dumpKees Cook
Since the panic handlers may produce additional information (via printk) for the kernel log, it should be reported as part of the panic output saved by kmsg_dump(). Without this re-ordering, nothing that adds information to a panic will show up in pstore's view when kmsg_dump runs, and is therefore not visible to crash reporting tools that examine pstore output. Signed-off-by: Kees Cook <keescook@chromium.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Acked-by: Tony Luck <tony.luck@intel.com> Cc: Stephen Boyd <sboyd@codeaurora.org> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kexec: remove unnecessary returnXishi Qiu
Code can not run here forever, so remove the unnecessary return. Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Suggested-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: Simon Horman <horms@verge.net.au> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11__ptrace_may_access() should not deny sub-threadsMark Grondona
__ptrace_may_access() checks get_dumpable/ptrace_has_cap/etc if task != current, this can can lead to surprising results. For example, a sub-thread can't readlink("/proc/self/exe") if the executable is not readable. setup_new_exec()->would_dump() notices that inode_permission(MAY_READ) fails and then it does set_dumpable(suid_dumpable). After that get_dumpable() fails. (It is not clear why proc_pid_readlink() checks get_dumpable(), perhaps we could add PTRACE_MODE_NODUMPABLE) Change __ptrace_may_access() to use same_thread_group() instead of "task == current". Any security check is pointless when the tasks share the same ->mm. Signed-off-by: Mark Grondona <mgrondona@llnl.gov> Signed-off-by: Ben Woodard <woodard@redhat.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kprobes: allow to specify custom allocator for insn cachesHeiko Carstens
The current two insn slot caches both use module_alloc/module_free to allocate and free insn slot cache pages. For s390 this is not sufficient since there is the need to allocate insn slots that are either within the vmalloc module area or within dma memory. Therefore add a mechanism which allows to specify an own allocator for an own insn slot cache. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kprobes: unify insn cachesHeiko Carstens
The current kpropes insn caches allocate memory areas for insn slots with module_alloc(). The assumption is that the kernel image and module area are both within the same +/- 2GB memory area. This however is not true for s390 where the kernel image resides within the first 2GB (DMA memory area), but the module area is far away in the vmalloc area, usually somewhere close below the 4TB area. For new pc relative instructions s390 needs insn slots that are within +/- 2GB of each area. That way we can patch displacements of pc-relative instructions within the insn slots just like x86 and powerpc. The module area works already with the normal insn slot allocator, however there is currently no way to get insn slots that are within the first 2GB on s390 (aka DMA area). Therefore this patch set modifies the kprobes insn slot cache code in order to allow to specify a custom allocator for the insn slot cache pages. In addition architecure can now have private insn slot caches withhout the need to modify common code. Patch 1 unifies and simplifies the current insn and optinsn caches implementation. This is a preparation which allows to add more insn caches in a simple way. Patch 2 adds the possibility to specify a custom allocator. Patch 3 makes s390 use the new insn slot mechanisms and adds support for pc-relative instructions with long displacements. This patch (of 3): The two insn caches (insn, and optinsn) each have an own mutex and alloc/free functions (get_[opt]insn_slot() / free_[opt]insn_slot()). Since there is the need for yet another insn cache which satifies dma allocations on s390, unify and simplify the current implementation: - Move the per insn cache mutex into struct kprobe_insn_cache. - Move the alloc/free functions to kprobe.h so they are simply wrappers for the generic __get_insn_slot/__free_insn_slot functions. The implementation is done with a DEFINE_INSN_CACHE_OPS() macro which provides the alloc/free functions for each cache if needed. - move the struct kprobe_insn_cache to kprobe.h which allows to generate architecture specific insn slot caches outside of the core kprobes code. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11task_work: documentationOleg Nesterov
No functional changes, just comments. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11task_work: minor cleanupsOleg Nesterov
Trivial. Remove the unnecessary "work = NULL" initialization and turn read_barrier_depends() into smp_read_barrier_depends() in task_work_cancel(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kernel/smp.c: quit unconditionally enabling irqs in on_each_cpu_mask().David Daney
As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in !SMP version of on_each_cpu()"), we don't want to enable irqs if they are not already enabled. I don't know of any bugs currently caused by this unconditional local_irq_enable(), but I want to use this function in MIPS/OCTEON early boot (when we have early_boot_irqs_disabled). This also makes this function have similar semantics to on_each_cpu() which is good in itself. Signed-off-by: David Daney <david.daney@cavium.com> Cc: Gilad Ben-Yossef <gilad@benyossef.com> Cc: Christoph Lameter <cl@linux.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11extable: skip sorting if the table is emptyUwe Kleine-König
At least on ARM no-MMU the extable is empty and so there is nothing to sort. So add a check for the table to be empty which effectively only changes that the misleading pr_notice is suppressed. Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: David Daney <david.daney@cavium.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11smp.h: move !SMP version of on_each_cpu() out-of-lineDavid Daney
All of the other non-trivial !SMP versions of functions in smp.h are out-of-line in up.c. Move on_each_cpu() there as well. This allows us to get rid of the #include <linux/irqflags.h>. The drawback is that this makes both the x86_64 and i386 defconfig !SMP kernels about 200 bytes larger each. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11up.c: use local_irq_{save,restore}() in smp_call_function_single.David Daney
The SMP version of this function doesn't unconditionally enable irqs, so neither should this !SMP version. There are no know problems caused by this, but we make the change for consistency's sake. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11smp: quit unconditionally enabling irq in on_each_cpu_mask and on_each_cpu_condDavid Daney
As in commit f21afc25f9ed ("smp.h: Use local_irq_{save,restore}() in !SMP version of on_each_cpu()"), we don't want to enable irqs if they are not already enabled. There are currently no known problematical callers of these functions, but since it is a known failure pattern, we preemptively fix them. Since they are not trivial functions, make them non-inline by moving them to up.c. This also makes it so we don't have to fix #include dependancies for preempt_{disable,enable}. Signed-off-by: David Daney <david.daney@cavium.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kernel/spinlock.c: add default arch_*_relax definitions for GENERIC_LOCKBREAKWill Deacon
When running with GENERIC_LOCKBREAK=y, the locking implementations emit calls to arch_{read,write,spin}_relax when spinning on a contended lock in order to allow architectures to favour the CPU owning the lock if possible. In reality, everybody apart from PowerPC and S390 just does cpu_relax() here, so make that the default behaviour and allow it to be overridden if required. Signed-off-by: Will Deacon <will.deacon@arm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kernel/smp.c: free related resources when failure occurs in hotplug_cfd()Chen Gang
When failure occurs in hotplug_cfd(), need release related resources, or will cause memory leak. Signed-off-by: Chen Gang <gang.chen@asianux.com> Acked-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kernel/modsign_pubkey.c: fix init const for module signing codeAndi Kleen
const has to use __initconst, not __initdata Signed-off-by: Andi Kleen <ak@linux.intel.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11kernel-wide: fix missing validations on __get/__put/__copy_to/__copy_from_user()Mathieu Desnoyers
I found the following pattern that leads in to interesting findings: grep -r "ret.*|=.*__put_user" * grep -r "ret.*|=.*__get_user" * grep -r "ret.*|=.*__copy" * The __put_user() calls in compat_ioctl.c, ptrace compat, signal compat, since those appear in compat code, we could probably expect the kernel addresses not to be reachable in the lower 32-bit range, so I think they might not be exploitable. For the "__get_user" cases, I don't think those are exploitable: the worse that can happen is that the kernel will copy kernel memory into in-kernel buffers, and will fail immediately afterward. The alpha csum_partial_copy_from_user() seems to be missing the access_ok() check entirely. The fix is inspired from x86. This could lead to information leak on alpha. I also noticed that many architectures map csum_partial_copy_from_user() to csum_partial_copy_generic(), but I wonder if the latter is performing the access checks on every architectures. Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: prepare to remove /proc/sys/vm/hugepages_treat_as_movableNaoya Horiguchi
Now hugepage migration is enabled, although restricted on pmd-based hugepages for now (due to lack of testing.) So we should allocate migratable hugepages from ZONE_MOVABLE if possible. This patch makes GFP flags in hugepage allocation dependent on migration support, not only the value of hugepages_treat_as_movable. It provides no change on the behavior for architectures which do not support hugepage migration, Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Andi Kleen <ak@linux.intel.com> Reviewed-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: use zone_end_pfn() instead of zone_start_pfn+spanned_pagesXishi Qiu
Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages". Simplify the code, no functional change. [akpm@linux-foundation.org: fix build] Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Cody P Schafer <cody@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11mm: mempolicy: turn vma_set_policy() into vma_dup_policy()Oleg Nesterov
Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11fork: unify and tighten up CLONE_NEWUSER/CLONE_NEWPID checksOleg Nesterov
do_fork() denies CLONE_THREAD | CLONE_PARENT if NEWUSER | NEWPID. Then later copy_process() denies CLONE_SIGHAND if the new process will be in a different pid namespace (task_active_pid_ns() doesn't match current->nsproxy->pid_ns). This looks confusing and inconsistent. CLONE_NEWPID is very similar to the case when ->pid_ns was already unshared, we want the same restrictions so copy_process() should also nack CLONE_PARENT. And it would be better to deny CLONE_NEWUSER && CLONE_SIGHAND as well just for consistency. Kill the "CLONE_NEWUSER | CLONE_NEWPID" check in do_fork() and change copy_process() to do the same check along with ->pid_ns check we already have. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11pidns: kill the unnecessary CLONE_NEWPID in copy_process()Oleg Nesterov
Commit 8382fcac1b81 ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_NEWPID if the forking process unshared pid_ns. This is correct but unnecessary, copy_pid_ns() does the same check. Remove the CLONE_NEWPID check to cleanup the code and prepare for the next change. Test-case: static int child(void *arg) { return 0; } static char stack[16 * 1024]; int main(void) { pid_t pid; assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); pid = clone(child, stack + sizeof(stack) / 2, CLONE_NEWPID | SIGCHLD, NULL); assert(pid < 0 && errno == EINVAL); return 0; } clone(CLONE_NEWPID) correctly fails with or without this change. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Colin Walters <walters@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11pidns: fix vfork() after unshare(CLONE_NEWPID)Oleg Nesterov
Commit 8382fcac1b81 ("pidns: Outlaw thread creation after unshare(CLONE_NEWPID)") nacks CLONE_VM if the forking process unshared pid_ns, this obviously breaks vfork: int main(void) { assert(unshare(CLONE_NEWUSER | CLONE_NEWPID) == 0); assert(vfork() >= 0); _exit(0); return 0; } fails without this patch. Change this check to use CLONE_SIGHAND instead. This also forbids CLONE_THREAD automatically, and this is what the comment implies. We could probably even drop CLONE_SIGHAND and use CLONE_THREAD, but it would be safer to not do this. The current check denies CLONE_SIGHAND implicitely and there is no reason to change this. Eric said "CLONE_SIGHAND is fine. CLONE_THREAD would be even better. Having shared signal handling between two different pid namespaces is the case that we are fundamentally guarding against." Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reported-by: Colin Walters <walters@redhat.com> Acked-by: Andy Lutomirski <luto@amacapital.net> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-09-11perf: Fix up MMAP2 buffer space reservationArnaldo Carvalho de Melo
The ino_generation field was added in the PERF_RECORD_MMAP2 record in the 13d7a24 cset but no space for it was allocated, corrupting the PERF_FORMAT_{TIME,CPU,TID,etc} area (sample_type/sample_id_all), fix it. Detected with one of the regression tests done by 'perf test': [root@sandy ~]# perf test -v 7 7: Validate PERF_RECORD_* events & perf_sample fields : --- start --- 61315294449606 0 PERF_RECORD_SAMPLE 61315294453161 0 PERF_RECORD_SAMPLE 61315294454441 0 PERF_RECORD_SAMPLE 61315294455709 0 PERF_RECORD_SAMPLE 61315295600899 0 PERF_RECORD_COMM: sleep:6500 27917287430500 342521613 PERF_RECORD_MMAP2 6500/6500: [0x400000(0x7000) @ 0 00:1d 311442 9016]: /usr/bin/sleep MMAP2 going backwards in time, prev=61315295600899, curr=27917287430500 MMAP2 with unexpected cpu, expected 0, got 342521613 MMAP2 with unexpected pid, expected 6500, got 1701606191 MMAP2 with unexpected tid, expected 6500, got 28773 27917287430500 342561333 PERF_RECORD_MMAP2 6500/6500: [0x3b7e000000(0x223000) @ 0 00:1d 309186 9016]: /usr/lib64/ld-2.16.so MMAP2 with unexpected cpu, expected 0, got 342561333 MMAP2 with unexpected pid, expected 6500, got 1932408369 MMAP2 with unexpected tid, expected 6500, got 111 27917287430500 342600095 PERF_RECORD_MMAP2 6500/6500: [0x7fffbd7dc000(0x1000) @ 0x7fffbd7dc000 00:00 0 0]: [vdso] MMAP2 with unexpected cpu, expected 0, got 342600095 MMAP2 with unexpected pid, expected 6500, got 1935963739 MMAP2 with unexpected tid, expected 6500, got 23919 27917287430500 342882834 PERF_RECORD_MMAP2 6500/6500: [0x3b7e400000(0x3b8000) @ 0 00:1d 309187 9016]: /usr/lib64/libc-2.16.so MMAP2 with unexpected cpu, expected 0, got 342882834 MMAP2 with unexpected pid, expected 6500, got 909192754 MMAP2 with unexpected tid, expected 6500, got 7303982 61316297195411 0 PERF_RECORD_EXIT(6500:6500):(6500:6500) ---- end ---- Validate PERF_RECORD_* events & perf_sample fields: FAILED! [root@sandy ~]# After this patch: [root@sandy ~]# perf test 7 7: Validate PERF_RECORD_* events & perf_sample fields : Ok [root@sandy ~]# Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Stephane Eranian <eranian@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: David Ahern <dsahern@gmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/n/tip-heeuv986b8ha7whqg4o3he7c@git.kernel.org Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-09-10fs: bump inode and dentry counters to longGlauber Costa
This series reworks our current object cache shrinking infrastructure in two main ways: * Noticing that a lot of users copy and paste their own version of LRU lists for objects, we put some effort in providing a generic version. It is modeled after the filesystem users: dentries, inodes, and xfs (for various tasks), but we expect that other users could benefit in the near future with little or no modification. Let us know if you have any issues. * The underlying list_lru being proposed automatically and transparently keeps the elements in per-node lists, and is able to manipulate the node lists individually. Given this infrastructure, we are able to modify the up-to-now hammer called shrink_slab to proceed with node-reclaim instead of always searching memory from all over like it has been doing. Per-node lru lists are also expected to lead to less contention in the lru locks on multi-node scans, since we are now no longer fighting for a global lock. The locks usually disappear from the profilers with this change. Although we have no official benchmarks for this version - be our guest to independently evaluate this - earlier versions of this series were performance tested (details at http://permalink.gmane.org/gmane.linux.kernel.mm/100537) yielding no visible performance regressions while yielding a better qualitative behavior in NUMA machines. With this infrastructure in place, we can use the list_lru entry point to provide memcg isolation and per-memcg targeted reclaim. Historically, those two pieces of work have been posted together. This version presents only the infrastructure work, deferring the memcg work for a later time, so we can focus on getting this part tested. You can see more about the history of such work at http://lwn.net/Articles/552769/ Dave Chinner (18): dcache: convert dentry_stat.nr_unused to per-cpu counters dentry: move to per-sb LRU locks dcache: remove dentries from LRU before putting on dispose list mm: new shrinker API shrinker: convert superblock shrinkers to new API list: add a new LRU list type inode: convert inode lru list to generic lru list code. dcache: convert to use new lru list infrastructure list_lru: per-node list infrastructure shrinker: add node awareness fs: convert inode and dentry shrinking to be node aware xfs: convert buftarg LRU to generic code xfs: rework buffer dispose list tracking xfs: convert dquot cache lru to list_lru fs: convert fs shrinkers to new scan/count API drivers: convert shrinkers to new count/scan API shrinker: convert remaining shrinkers to count/scan API shrinker: Kill old ->shrink API. Glauber Costa (7): fs: bump inode and dentry counters to long super: fix calculation of shrinkable objects for small numbers list_lru: per-node API vmscan: per-node deferred work i915: bail out earlier when shrinker cannot acquire mutex hugepage: convert huge zero page shrinker to new shrinker API list_lru: dynamically adjust node arrays This patch: There are situations in very large machines in which we can have a large quantity of dirty inodes, unused dentries, etc. This is particularly true when umounting a filesystem, where eventually since every live object will eventually be discarded. Dave Chinner reported a problem with this while experimenting with the shrinker revamp patchset. So we believe it is time for a change. This patch just moves int to longs. Machines where it matters should have a big long anyway. Signed-off-by: Glauber Costa <glommer@openvz.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Dave Chinner <dchinner@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>