aboutsummaryrefslogtreecommitdiff
path: root/kernel/exit.c
AgeCommit message (Collapse)Author
2013-06-15move exit_task_namespaces() outside of exit_notify()Oleg Nesterov
exit_notify() does exit_task_namespaces() after forget_original_parent(). This was needed to ensure that ->nsproxy can't be cleared prematurely, an exiting child we are going to reparent can do do_notify_parent() and use the parent's (ours) pid_ns. However, after 32084504 "pidns: use task_active_pid_ns in do_notify_parent" ->nsproxy != NULL is no longer needed, we rely on task_active_pid_ns(). Move exit_task_namespaces() from exit_notify() to do_exit(), after exit_fs() and before exit_task_work(). This solves the problem reported by Andrey, free_ipc_ns()->shm_destroy() does fput() which needs task_work_add(). Note: this particular problem can be fixed if we change fput(), and that change makes sense anyway. But there is another reason to move the callsite. The original reason for exit_task_namespaces() from the middle of exit_notify() was subtle and it has already gone away, now this looks confusing. And this allows us do simplify exit_notify(), we can avoid unlock/lock(tasklist) and we can use ->exit_state instead of PF_EXITING in forget_original_parent(). Reported-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Andrey Vagin <avagin@openvz.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS updates from Al Viro, Misc cleanups all over the place, mainly wrt /proc interfaces (switch create_proc_entry to proc_create(), get rid of the deprecated create_proc_read_entry() in favor of using proc_create_data() and seq_file etc). 7kloc removed. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits) don't bother with deferred freeing of fdtables proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h proc: Make the PROC_I() and PDE() macros internal to procfs proc: Supply a function to remove a proc entry by PDE take cgroup_open() and cpuset_open() to fs/proc/base.c ppc: Clean up scanlog ppc: Clean up rtas_flash driver somewhat hostap: proc: Use remove_proc_subtree() drm: proc: Use remove_proc_subtree() drm: proc: Use minor->index to label things, not PDE->name drm: Constify drm_proc_list[] zoran: Don't print proc_dir_entry data in debug reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show() proc: Supply an accessor for getting the data from a PDE's parent airo: Use remove_proc_subtree() rtl8192u: Don't need to save device proc dir PDE rtl8187se: Use a dir under /proc/net/r8180/ proc: Add proc_mkdir_data() proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h} proc: Move PDE_NET() to fs/proc/proc_net.c ...
2013-05-01Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull compat cleanup from Al Viro: "Mostly about syscall wrappers this time; there will be another pile with patches in the same general area from various people, but I'd rather push those after both that and vfs.git pile are in." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: syscalls.h: slightly reduce the jungles of macros get rid of union semop in sys_semctl(2) arguments make do_mremap() static sparc: no need to sign-extend in sync_file_range() wrapper ppc compat wrappers for add_key(2) and request_key(2) are pointless x86: trim sys_ia32.h x86: sys32_kill and sys32_mprotect are pointless get rid of compat_sys_semctl() and friends in case of ARCH_WANT_OLD_COMPAT_IPC merge compat sys_ipc instances consolidate compat lookup_dcookie() convert vmsplice to COMPAT_SYSCALL_DEFINE switch getrusage() to COMPAT_SYSCALL_DEFINE switch epoll_pwait to COMPAT_SYSCALL_DEFINE convert sendfile{,64} to COMPAT_SYSCALL_DEFINE switch signalfd{,4}() to COMPAT_SYSCALL_DEFINE make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protect make HAVE_SYSCALL_WRAPPERS unconditional consolidate cond_syscall and SYSCALL_ALIAS declarations teach SYSCALL_DEFINE<n> how to deal with long long/unsigned long long get rid of duplicate logics in __SC_....[1-6] definitions
2013-04-09get rid of the last free_pipe_info() callersAl Viro
and rename __free_pipe_info() to free_pipe_info() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-03-31Revert "lockdep: check that no locks held at freeze time"Paul Walmsley
This reverts commit 6aa9707099c4b25700940eb3d016f16c4434360d. Commit 6aa9707099c4 ("lockdep: check that no locks held at freeze time") causes problems with NFS root filesystems. The failures were noticed on OMAP2 and 3 boards during kernel init: [ BUG: swapper/0/1 still has locks held! ] 3.9.0-rc3-00344-ga937536 #1 Not tainted ------------------------------------- 1 lock held by swapper/0/1: #0: (&type->s_umount_key#13/1){+.+.+.}, at: [<c011e84c>] sget+0x248/0x574 stack backtrace: rpc_wait_bit_killable __wait_on_bit out_of_line_wait_on_bit __rpc_execute rpc_run_task rpc_call_sync nfs_proc_get_root nfs_get_root nfs_fs_mount_common nfs_try_mount nfs_fs_mount mount_fs vfs_kern_mount do_mount sys_mount do_mount_root mount_root prepare_namespace kernel_init_freeable kernel_init Although the rootfs mounts, the system is unstable. Here's a transcript from a PM test: http://www.pwsan.com/omap/testlogs/test_v3.9-rc3/20130317194234/pm/37xxevm/37xxevm_log.txt Here's what the test log should look like: http://www.pwsan.com/omap/testlogs/test_v3.8/20130218214403/pm/37xxevm/37xxevm_log.txt Mailing list discussion is here: http://lkml.org/lkml/2013/3/4/221 Deal with this for v3.9 by reverting the problem commit, until folks can figure out the right long-term course of action. Signed-off-by: Paul Walmsley <paul@pwsan.com> Cc: Mandeep Singh Baines <msb@chromium.org> Cc: Jeff Layton <jlayton@redhat.com> Cc: Shawn Guo <shawn.guo@linaro.org> Cc: <maciej.rutecki@gmail.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Ben Chan <benchan@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-03-03make SYSCALL_DEFINE<n>-generated wrappers do asmlinkage_protectAl Viro
... and switch i386 to HAVE_SYSCALL_WRAPPERS, killing open-coded uses of asmlinkage_protect() in a bunch of syscalls. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-27coredump: use a freezable_schedule for the coredump_finish waitMandeep Singh Baines
Prevents hung_task detector from panicing the machine. This is also needed to prevent this wait from blocking suspend. (It doesnt' currently block suspend but it would once the next patch in this series is applied.) [yongjun_wei@trendmicro.com.cn: kernel/exit.c: remove duplicated include] Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ben Chan <benchan@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27lockdep: check that no locks held at freeze timeMandeep Singh Baines
We shouldn't try_to_freeze if locks are held. Holding a lock can cause a deadlock if the lock is later acquired in the suspend or hibernate path (e.g. by dpm). Holding a lock can also cause a deadlock in the case of cgroup_freezer if a lock is held inside a frozen cgroup that is later acquired by a process outside that group. [akpm@linux-foundation.org: export debug_check_no_locks_held] Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Cc: Ben Chan <benchan@chromium.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-01-27cputime: Use accessors to read task cputime statsFrederic Weisbecker
This is in preparation for the full dynticks feature. While remotely reading the cputime of a task running in a full dynticks CPU, we'll need to do some extra-computation. This way we can account the time it spent tickless in userspace since its last cputime snapshot. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Li Zhong <zhong@linux.vnet.ibm.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2012-12-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace changes from Eric Biederman: "While small this set of changes is very significant with respect to containers in general and user namespaces in particular. The user space interface is now complete. This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyranny of supporting suid root preventing unprivileged users from using cool new kernel features is broken. This set of changes completes the work on setns, adding support for the pid, user, mount namespaces. This set of changes includes a bunch of basic pid namespace cleanups/simplifications. Of particular significance is the rework of the pid namespace cleanup so it no longer requires sending out tendrils into all kinds of unexpected cleanup paths for operation. At least one case of broken error handling is fixed by this cleanup. The files under /proc/<pid>/ns/ have been converted from regular files to magic symlinks which prevents incorrect caching by the VFS, ensuring the files always refer to the namespace the process is currently using and ensuring that the ptrace_mayaccess permission checks are always applied. The files under /proc/<pid>/ns/ have been given stable inode numbers so it is now possible to see if different processes share the same namespaces. Through the David Miller's net tree are changes to relax many of the permission checks in the networking stack to allowing the user namespace root to usefully use the networking stack. Similar changes for the mount namespace and the pid namespace are coming through my tree. Two small changes to add user namespace support were commited here adn in David Miller's -net tree so that I could complete the work on the /proc/<pid>/ns/ files in this tree. Work remains to make it safe to build user namespaces and 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig guard remains in place preventing that user namespaces from being built when any of those filesystems are enabled. Future design work remains to allow root users outside of the initial user namespace to mount more than just /proc and /sys." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (38 commits) proc: Usable inode numbers for the namespace file descriptors. proc: Fix the namespace inode permission checks. proc: Generalize proc inode allocation userns: Allow unprivilged mounts of proc and sysfs userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file procfs: Print task uids and gids in the userns that opened the proc file userns: Implement unshare of the user namespace userns: Implent proc namespace operations userns: Kill task_user_ns userns: Make create_new_namespaces take a user_ns parameter userns: Allow unprivileged use of setns. userns: Allow unprivileged users to create new namespaces userns: Allow setting a userns mapping to your current uid. userns: Allow chown and setgid preservation userns: Allow unprivileged users to create user namespaces. userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped userns: fix return value on mntns_install() failure vfs: Allow unprivileged manipulation of the mount namespace. vfs: Only support slave subtrees across different user namespaces vfs: Add a user namespace reference from struct mnt_namespace ...
2012-12-12Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull big execve/kernel_thread/fork unification series from Al Viro: "All architectures are converted to new model. Quite a bit of that stuff is actually shared with architecture trees; in such cases it's literally shared branch pulled by both, not a cherry-pick. A lot of ugliness and black magic is gone (-3KLoC total in this one): - kernel_thread()/kernel_execve()/sys_execve() redesign. We don't do syscalls from kernel anymore for either kernel_thread() or kernel_execve(): kernel_thread() is essentially clone(2) with callback run before we return to userland, the callbacks either never return or do successful do_execve() before returning. kernel_execve() is a wrapper for do_execve() - it doesn't need to do transition to user mode anymore. As a result kernel_thread() and kernel_execve() are arch-independent now - they live in kernel/fork.c and fs/exec.c resp. sys_execve() is also in fs/exec.c and it's completely architecture-independent. - daemonize() is gone, along with its parts in fs/*.c - struct pt_regs * is no longer passed to do_fork/copy_process/ copy_thread/do_execve/search_binary_handler/->load_binary/do_coredump. - sys_fork()/sys_vfork()/sys_clone() unified; some architectures still need wrappers (ones with callee-saved registers not saved in pt_regs on syscall entry), but the main part of those suckers is in kernel/fork.c now." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (113 commits) do_coredump(): get rid of pt_regs argument print_fatal_signal(): get rid of pt_regs argument ptrace_signal(): get rid of unused arguments get rid of ptrace_signal_deliver() arguments new helper: signal_pt_regs() unify default ptrace_signal_deliver flagday: kill pt_regs argument of do_fork() death to idle_regs() don't pass regs to copy_process() flagday: don't pass regs to copy_thread() bfin: switch to generic vfork, get rid of pointless wrappers xtensa: switch to generic clone() openrisc: switch to use of generic fork and clone unicore32: switch to generic clone(2) score: switch to generic fork/vfork/clone c6x: sanitize copy_thread(), get rid of clone(2) wrapper, switch to generic clone() take sys_fork/sys_vfork/sys_clone prototypes to linux/syscalls.h mn10300: switch to generic fork/vfork/clone h8300: switch to generic fork/vfork/clone tile: switch to generic clone() ... Conflicts: arch/microblaze/include/asm/Kbuild
2012-11-28kill daemonize()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-11-28cputime: Rename thread_group_times to thread_group_cputime_adjustedFrederic Weisbecker
We have thread_group_cputime() and thread_group_times(). The naming doesn't provide enough information about the difference between these two APIs. To lower the confusion, rename thread_group_times() to thread_group_cputime_adjusted(). This name better suggests that it's a version of thread_group_cputime() that does some stabilization on the raw cputime values. ie here: scale on top of CFS runtime stats and bound lower value for monotonicity. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-11-19pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1Eric W. Biederman
Looking at pid_ns->nr_hashed is a bit simpler and it works for disjoint process trees that an unshare or a join of a pid_namespace may create. Acked-by: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-10-02Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs update from Al Viro: - big one - consolidation of descriptor-related logics; almost all of that is moved to fs/file.c (BTW, I'm seriously tempted to rename the result to fd.c. As it is, we have a situation when file_table.c is about handling of struct file and file.c is about handling of descriptor tables; the reasons are historical - file_table.c used to be about a static array of struct file we used to have way back). A lot of stray ends got cleaned up and converted to saner primitives, disgusting mess in android/binder.c is still disgusting, but at least doesn't poke so much in descriptor table guts anymore. A bunch of relatively minor races got fixed in process, plus an ext4 struct file leak. - related thing - fget_light() partially unuglified; see fdget() in there (and yes, it generates the code as good as we used to have). - also related - bits of Cyrill's procfs stuff that got entangled into that work; _not_ all of it, just the initial move to fs/proc/fd.c and switch of fdinfo to seq_file. - Alex's fs/coredump.c spiltoff - the same story, had been easier to take that commit than mess with conflicts. The rest is a separate pile, this was just a mechanical code movement. - a few misc patches all over the place. Not all for this cycle, there'll be more (and quite a few currently sit in akpm's tree)." Fix up trivial conflicts in the android binder driver, and some fairly simple conflicts due to two different changes to the sock_alloc_file() interface ("take descriptor handling from sock_alloc_file() to callers" vs "net: Providing protocol type via system.sockprotoname xattr of /proc/PID/fd entries" adding a dentry name to the socket) * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (72 commits) MAX_LFS_FILESIZE should be a loff_t compat: fs: Generic compat_sys_sendfile implementation fs: push rcu_barrier() from deactivate_locked_super() to filesystems btrfs: reada_extent doesn't need kref for refcount coredump: move core dump functionality into its own file coredump: prevent double-free on an error path in core dumper usb/gadget: fix misannotations fcntl: fix misannotations ceph: don't abuse d_delete() on failure exits hypfs: ->d_parent is never NULL or negative vfs: delete surplus inode NULL check switch simple cases of fget_light to fdget new helpers: fdget()/fdput() switch o2hb_region_dev_write() to fget_light() proc_map_files_readdir(): don't bother with grabbing files make get_file() return its argument vhost_set_vring(): turn pollstart/pollstop into bool switch prctl_set_mm_exe_file() to fget_light() switch xfs_find_handle() to fget_light() switch xfs_swapext() to fget_light() ...
2012-09-26new helper: daemonize_descriptors()Al Viro
descriptor-related parts of daemonize, done right. As the result we simplify the locking rules for ->files - we hold task_lock in *all* cases when we modify ->files. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-26move files_struct-related bits from kernel/exit.c to fs/file.cAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-09-24net: use a per task frag allocatorEric Dumazet
We currently use a per socket order-0 page cache for tcp_sendmsg() operations. This page is used to build fragments for skbs. Its done to increase probability of coalescing small write() into single segments in skbs still in write queue (not yet sent) But it wastes a lot of memory for applications handling many mostly idle sockets, since each socket holds one page in sk->sk_sndmsg_page Its also quite inefficient to build TSO 64KB packets, because we need about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit page allocator more than wanted. This patch adds a per task frag allocator and uses bigger pages, if available. An automatic fallback is done in case of memory pressure. (up to 32768 bytes per frag, thats order-3 pages on x86) This increases TCP stream performance by 20% on loopback device, but also benefits on other network devices, since 8x less frags are mapped on transmit and unmapped on tx completion. Alexander Duyck mentioned a probable performance win on systems with IOMMU enabled. Its possible some SG enabled hardware cant cope with bigger fragments, but their ndo_start_xmit() should already handle this, splitting a fragment in sub fragments, since some arches have PAGE_SIZE=65536 Successfully tested on various ethernet devices. (ixgbe, igb, bnx2x, tg3, mellanox mlx4) Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Vijay Subramanian <subramanian.vijay@gmail.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Tested-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-26posix_types.h: Cleanup stale __NFDBITS and related definitionsJosh Boyer
Recently, glibc made a change to suppress sign-conversion warnings in FD_SET (glibc commit ceb9e56b3d1). This uncovered an issue with the kernel's definition of __NFDBITS if applications #include <linux/types.h> after including <sys/select.h>. A build failure would be seen when passing the -Werror=sign-compare and -D_FORTIFY_SOURCE=2 flags to gcc. It was suggested that the kernel should either match the glibc definition of __NFDBITS or remove that entirely. The current in-kernel uses of __NFDBITS can be replaced with BITS_PER_LONG, and there are no uses of the related __FDELT and __FDMASK defines. Given that, we'll continue the cleanup that was started with commit 8b3d1cda4f5f ("posix_types: Remove fd_set macros") and drop the remaining unused macros. Additionally, linux/time.h has similar macros defined that expand to nothing so we'll remove those at the same time. Reported-by: Jeff Law <law@redhat.com> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> CC: <stable@vger.kernel.org> Signed-off-by: Josh Boyer <jwboyer@redhat.com> [ .. and fix up whitespace as per akpm ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-07-22move exit_task_work() past exit_files() et.al.Al Viro
... and get rid of PF_EXITING check in task_work_add(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-20pidns: find_new_reaper() can no longer switch to init_pid_ns.child_reaperOleg Nesterov
find_new_reaper() changes pid_ns->child_reaper, see add0d4df ("pid_ns: zap_pid_ns_processes: fix the ->child_reaper changing"). The original reason has gone away after the previous patch, ->children list must be empty after zap_pid_ns_processes(). However now we can not switch to init_pid_ns.child_reaper. __unhash_process() relies on the "->child_reaper == parent" check, but this check does not work if the last exiting task is also the child reaper. As Eric sugested, we can change __unhash_process() to use the parent's pid_ns and remove this code. Also, with this change we can move detach_pid(PIDTYPE_PID) back, where it was before the previous fix. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Louis Rilling <louis.rilling@kerlabs.com> Cc: Mike Galbraith <efault@gmx.de> Acked-by: Pavel Emelyanov <xemul@parallels.com> Tested-by: Andrew Wagin <avagin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20pidns: guarantee that the pidns init will be the last pidns process reapedEric W. Biederman
Today we have a twofold bug. Sometimes release_task on pid == 1 in a pid namespace can run before other processes in a pid namespace have had release task called. With the result that pid_ns_release_proc can be called before the last proc_flus_task() is done using upid->ns->proc_mnt, resulting in the use of a stale pointer. This same set of circumstances can lead to waitpid(...) returning for a processes started with clone(CLONE_NEWPID) before the every process in the pid namespace has actually exited. To fix this modify zap_pid_ns_processess wait until all other processes in the pid namespace have exited, even EXIT_DEAD zombies. The delay_group_leader and related tests ensure that the thread gruop leader will be the last thread of a process group to be reaped, or to become EXIT_DEAD and self reap. With the change to zap_pid_ns_processes we get the guarantee that pid == 1 in a pid namespace will be the last task that release_task is called on. With pid == 1 being the last task to pass through release_task pid_ns_release_proc can no longer be called too early nor can wait return before all of the EXIT_DEAD tasks in a pid namespace have exited. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Louis Rilling <louis.rilling@kerlabs.com> Cc: Mike Galbraith <efault@gmx.de> Acked-by: Pavel Emelyanov <xemul@parallels.com> Tested-by: Andrew Wagin <avagin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-20mm: correctly synchronize rss-counters at exit/execKonstantin Khlebnikov
do_exit() and exec_mmap() call sync_mm_rss() before mm_release() does put_user(clear_child_tid) which can update task->rss_stat and thus make mm->rss_stat inconsistent. This triggers the "BUG:" printk in check_mm(). Let's fix this bug in the safest way, and optimize/cleanup this later. Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07Revert "mm: correctly synchronize rss-counters at exit/exec"Linus Torvalds
This reverts commit 40af1bbdca47e5c8a2044039bb78ca8fd8b20f94. It's horribly and utterly broken for at least the following reasons: - calling sync_mm_rss() from mmput() is fundamentally wrong, because there's absolutely no reason to believe that the task that does the mmput() always does it on its own VM. Example: fork, ptrace, /proc - you name it. - calling it *after* having done mmdrop() on it is doubly insane, since the mm struct may well be gone now. - testing mm against NULL before you call it is insane too, since a NULL mm there would have caused oopses long before. .. and those are just the three bugs I found before I decided to give up looking for me and revert it asap. I should have caught it before I even took it, but I trusted Andrew too much. Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-07mm: correctly synchronize rss-counters at exit/execKonstantin Khlebnikov
mm->rss_stat counters have per-task delta: task->rss_stat. Before changing task->mm pointer the kernel must flush this delta with sync_mm_rss(). do_exit() already calls sync_mm_rss() to flush the rss-counters before committing the rss statistics into task->signal->maxrss, taskstats, audit and other stuff. Unfortunately the kernel does this before calling mm_release(), which can call put_user() for processing task->clear_child_tid. So at this point we can trigger page-faults and task->rss_stat becomes non-zero again. As a result mm->rss_stat becomes inconsistent and check_mm() will print something like this: | BUG: Bad rss-counter state mm:ffff88020813c380 idx:1 val:-1 | BUG: Bad rss-counter state mm:ffff88020813c380 idx:2 val:1 This patch moves sync_mm_rss() into mm_release(), and moves mm_release() out of do_exit() and calls it earlier. After mm_release() there should be no pagefaults. [akpm@linux-foundation.org: tweak comment] Signed-off-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Cc: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: <stable@vger.kernel.org> [3.4.x] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal Pull second pile of signal handling patches from Al Viro: "This one is just task_work_add() series + remaining prereqs for it. There probably will be another pull request from that tree this cycle - at least for helpers, to get them out of the way for per-arch fixes remaining in the tree." Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile had brought in commit 97fd75b7b8e0 ("kernel/irq/manage.c: use the pr_foo() infrastructure to prefix printks") which changed one of the pr_err() calls that this merge moves around. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: keys: kill task_struct->replacement_session_keyring keys: kill the dummy key_replace_session_keyring() keys: change keyctl_session_to_parent() to use task_work_add() genirq: reimplement exit_irq_thread() hook via task_work_add() task_work_add: generic process-context callbacks avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers parisc: need to check NOTIFY_RESUME when exiting from syscall move key_repace_session_keyring() into tracehook_notify_resume() TIF_NOTIFY_RESUME is defined on all targets now
2012-05-31stack usage: add pid to warning printk in check_stack_usageTim Bird
In embedded systems, sometimes the same program (busybox) is the cause of multiple warnings. Outputting the pid with the program name in the warning printk helps distinguish which instances of a program are using the stack most. This is a small patch, but useful. Signed-off-by: Tim Bird <tim.bird@am.sony.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-31cred: remove task_is_dead() from __task_cred() validationOleg Nesterov
Commit 8f92054e7ca1 ("CRED: Fix __task_cred()'s lockdep check and banner comment"): add the following validation condition: task->exit_state >= 0 to permit the access if the target task is dead and therefore unable to change its own credentials. OK, but afaics currently this can only help wait_task_zombie() which calls __task_cred() without rcu lock. Remove this validation and change wait_task_zombie() to use task_uid() instead. This means we do rcu_read_lock() only to shut up the lockdep, but we already do the same in, say, wait_task_stopped(). task_is_dead() should die, task->exit_state != 0 means that this task has passed exit_notify(), only do_wait-like code paths should use this. Unfortunately, we can't kill task_is_dead() right now, it has already acquired buggy users in drivers/staging. The fix already exists. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-23genirq: reimplement exit_irq_thread() hook via task_work_add()Oleg Nesterov
exit_irq_thread() and task->irq_thread are needed to handle the unexpected (and unlikely) exit of irq-thread. We can use task_work instead and make this all private to kernel/irq/manage.c, cleanup plus micro-optimization. 1. rename exit_irq_thread() to irq_thread_dtor(), make it static, and move it up before irq_thread(). 2. change irq_thread() to do task_work_add(irq_thread_dtor) at the start and task_work_cancel() before return. tracehook_notify_resume() can never play with kthreads, only do_exit()->exit_task_work() can call the callback and this is what we want. 3. remove task_struct->irq_thread and the special hook in do_exit(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Cc: David Howells <dhowells@redhat.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Smith <dsmith@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-23task_work_add: generic process-context callbacksOleg Nesterov
Provide a simple mechanism that allows running code in the (nonatomic) context of the arbitrary task. The caller does task_work_add(task, task_work) and this task executes task_work->func() either from do_notify_resume() or from do_exit(). The callback can rely on PF_EXITING to detect the latter case. "struct task_work" can be embedded in another struct, still it has "void *data" to handle the most common/simple case. This allows us to kill the ->replacement_session_keyring hack, and potentially this can have more users. Performance-wise, this adds 2 "unlikely(!hlist_empty())" checks into tracehook_notify_resume() and do_exit(). But at the same time we can remove the "replacement_session_keyring != NULL" checks from arch/*/signal.c and exit_creds(). Note: task_work_add/task_work_run abuses ->pi_lock. This is only because this lock is already used by lookup_pi_state() to synchronize with do_exit() setting PF_EXITING. Fortunately the scope of this lock in task_work.c is really tiny, and the code is unlikely anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: David Howells <dhowells@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Alexander Gordeev <agordeev@redhat.com> Cc: Chris Zankel <chris@zankel.net> Cc: David Smith <dsmith@redhat.com> Cc: "Frank Ch. Eigler" <fche@redhat.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-05-17cred: use correct cred accessor with regards to rcu read lockSasha Levin
Commit "userns: Convert setting and getting uid and gid system calls to use kuid and kgid has modified the accessors in wait_task_continued() and wait_task_stopped() to use __task_cred() instead of task_uid(). __task_cred() assumes that we're inside a rcu read lock, which is untrue for these two functions. Modify it to use task_uid() instead. Signed-off-by: Sasha Levin <levinsasha928@gmail.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-03userns: Convert setting and getting uid and gid system calls to use kuid and ↵Eric W. Biederman
kgid Convert setregid, setgid, setreuid, setuid, setresuid, getresuid, setresgid, getresgid, setfsuid, setfsgid, getuid, geteuid, getgid, getegid, waitpid, waitid, wait4. Convert userspace uids and gids into kuids and kgids before being placed on struct cred. Convert struct cred kuids and kgids into userspace uids and gids when returning them. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-03-29Merge branch 'x86-x32-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x32 support for x86-64 from Ingo Molnar: "This tree introduces the X32 binary format and execution mode for x86: 32-bit data space binaries using 64-bit instructions and 64-bit kernel syscalls. This allows applications whose working set fits into a 32 bits address space to make use of 64-bit instructions while using a 32-bit address space with shorter pointers, more compressed data structures, etc." Fix up trivial context conflicts in arch/x86/{Kconfig,vdso/vma.c} * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits) x32: Fix alignment fail in struct compat_siginfo x32: Fix stupid ia32/x32 inversion in the siginfo format x32: Add ptrace for x32 x32: Switch to a 64-bit clock_t x32: Provide separate is_ia32_task() and is_x32_task() predicates x86, mtrr: Use explicit sizing and padding for the 64-bit ioctls x86/x32: Fix the binutils auto-detect x32: Warn and disable rather than error if binutils too old x32: Only clear TIF_X32 flag once x32: Make sure TS_COMPAT is cleared for x32 tasks fs: Remove missed ->fds_bits from cessation use of fd_set structs internally fs: Fix close_on_exec pointer in alloc_fdtable x32: Drop non-__vdso weak symbols from the x32 VDSO x32: Fix coding style violations in the x32 VDSO code x32: Add x32 VDSO support x32: Allow x32 to be configured x32: If configured, add x32 system calls to system call tables x32: Handle process creation x32: Signal-related system calls x86: Add #ifdef CONFIG_COMPAT to <asm/sys_ia32.h> ...
2012-03-23kernel/exit.c: if init dies, log a signal which killed it, if anyDenys Vlasenko
I just received another user's pleas for help when their init mysteriously died. I again explained that they need to check whether it died because of bad instruction, a segv, or something else. Which was an annoying detour into writing a trivial C program to spawn his init and print its exit code: http://lists.busybox.net/pipermail/busybox/2012-January/077172.html I hear you saying "just test it under /bin/sh". Well, the crashing init _was_ /bin/sh. Which prompted me to make kernel do this first step automatically. We can print exit code, which makes it possible to see that death was from e.g. SIGILL without writing test programs. [akpm@linux-foundation.org: add 0x to hex number output] Signed-off-by: Denys Vlasenko <vda.linux@googlemail.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-23prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervisionLennart Poettering
Userspace service managers/supervisors need to track their started services. Many services daemonize by double-forking and get implicitly re-parented to PID 1. The service manager will no longer be able to receive the SIGCHLD signals for them, and is no longer in charge of reaping the children with wait(). All information about the children is lost at the moment PID 1 cleans up the re-parented processes. With this prctl, a service manager process can mark itself as a sort of 'sub-init', able to stay as the parent for all orphaned processes created by the started services. All SIGCHLD signals will be delivered to the service manager. Receiving SIGCHLD and doing wait() is in cases of a service-manager much preferred over any possible asynchronous notification about specific PIDs, because the service manager has full access to the child process data in /proc and the PID can not be re-used until the wait(), the service-manager itself is in charge of, has happened. As a side effect, the relevant parent PID information does not get lost by a double-fork, which results in a more elaborate process tree and 'ps' output: before: # ps afx 253 ? Ss 0:00 /bin/dbus-daemon --system --nofork 294 ? Sl 0:00 /usr/libexec/polkit-1/polkitd 328 ? S 0:00 /usr/sbin/modem-manager 608 ? Sl 0:00 /usr/libexec/colord 658 ? Sl 0:00 /usr/libexec/upowerd 819 ? Sl 0:00 /usr/libexec/imsettings-daemon 916 ? Sl 0:00 /usr/libexec/udisks-daemon 917 ? S 0:00 \_ udisks-daemon: not polling any devices after: # ps afx 294 ? Ss 0:00 /bin/dbus-daemon --system --nofork 426 ? Sl 0:00 \_ /usr/libexec/polkit-1/polkitd 449 ? S 0:00 \_ /usr/sbin/modem-manager 635 ? Sl 0:00 \_ /usr/libexec/colord 705 ? Sl 0:00 \_ /usr/libexec/upowerd 959 ? Sl 0:00 \_ /usr/libexec/udisks-daemon 960 ? S 0:00 | \_ udisks-daemon: not polling any devices 977 ? Sl 0:00 \_ /usr/libexec/packagekitd This prctl is orthogonal to PID namespaces. PID namespaces are isolated from each other, while a service management process usually requires the services to live in the same namespace, to be able to talk to each other. Users of this will be the systemd per-user instance, which provides init-like functionality for the user's login session and D-Bus, which activates bus services on-demand. Both need init-like capabilities to be able to properly keep track of the services they start. Many thanks to Oleg for several rounds of review and insights. [akpm@linux-foundation.org: fix comment layout and spelling] [akpm@linux-foundation.org: add lengthy code comment from Oleg] Reviewed-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Lennart Poettering <lennart@poettering.net> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Acked-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-22Merge branch 'akpm' (Andrew's patch-bomb)Linus Torvalds
Merge first batch of patches from Andrew Morton: "A few misc things and all the MM queue" * emailed from Andrew Morton <akpm@linux-foundation.org>: (92 commits) memcg: avoid THP split in task migration thp: add HPAGE_PMD_* definitions for !CONFIG_TRANSPARENT_HUGEPAGE memcg: clean up existing move charge code mm/memcontrol.c: remove unnecessary 'break' in mem_cgroup_read() mm/memcontrol.c: remove redundant BUG_ON() in mem_cgroup_usage_unregister_event() mm/memcontrol.c: s/stealed/stolen/ memcg: fix performance of mem_cgroup_begin_update_page_stat() memcg: remove PCG_FILE_MAPPED memcg: use new logic for page stat accounting memcg: remove PCG_MOVE_LOCK flag from page_cgroup memcg: simplify move_account() check memcg: remove EXPORT_SYMBOL(mem_cgroup_update_page_stat) memcg: kill dead prev_priority stubs memcg: remove PCG_CACHE page_cgroup flag memcg: let css_get_next() rely upon rcu_read_lock() cgroup: revert ss_id_lock to spinlock idr: make idr_get_next() good for rcu_read_lock() memcg: remove unnecessary thp check in page stat accounting memcg: remove redundant returns memcg: enum lru_list lru ...
2012-03-21mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat()David Rientjes
sync_mm_rss() can only be used for current to avoid race conditions in iterating and clearing its per-task counters. Remove the task argument for it and its helper function, __sync_task_rss_stat(), to avoid thinking it can be used safely for anything other than current. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-21Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security Pull security subsystem updates for 3.4 from James Morris: "The main addition here is the new Yama security module from Kees Cook, which was discussed at the Linux Security Summit last year. Its purpose is to collect miscellaneous DAC security enhancements in one place. This also marks a departure in policy for LSM modules, which were previously limited to being standalone access control systems. Chromium OS is using Yama, and I believe there are plans for Ubuntu, at least. This patchset also includes maintenance updates for AppArmor, TOMOYO and others." Fix trivial conflict in <net/sock.h> due to the jumo_label->static_key rename. * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (38 commits) AppArmor: Fix location of const qualifier on generated string tables TOMOYO: Return error if fails to delete a domain AppArmor: add const qualifiers to string arrays AppArmor: Add ability to load extended policy TOMOYO: Return appropriate value to poll(). AppArmor: Move path failure information into aa_get_name and rename AppArmor: Update dfa matching routines. AppArmor: Minor cleanup of d_namespace_path to consolidate error handling AppArmor: Retrieve the dentry_path for error reporting when path lookup fails AppArmor: Add const qualifiers to generated string tables AppArmor: Fix oops in policy unpack auditing AppArmor: Fix error returned when a path lookup is disconnected KEYS: testing wrong bit for KEY_FLAG_REVOKED TOMOYO: Fix mount flags checking order. security: fix ima kconfig warning AppArmor: Fix the error case for chroot relative path name lookup AppArmor: fix mapping of META_READ to audit and quiet flags AppArmor: Fix underflow in xindex calculation AppArmor: Fix dropping of allowed operations that are force audited AppArmor: Add mising end of structure test to caps unpacking ...
2012-03-21Merge tag 'pm-for-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates for 3.4 from Rafael Wysocki: "Assorted extensions and fixes including: * Introduction of early/late suspend/hibernation device callbacks. * Generic PM domains extensions and fixes. * devfreq updates from Axel Lin and MyungJoo Ham. * Device PM QoS updates. * Fixes of concurrency problems with wakeup sources. * System suspend and hibernation fixes." * tag 'pm-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (43 commits) PM / Domains: Check domain status during hibernation restore of devices PM / devfreq: add relation of recommended frequency. PM / shmobile: Make MTU2 driver use pm_genpd_dev_always_on() PM / shmobile: Make CMT driver use pm_genpd_dev_always_on() PM / shmobile: Make TMU driver use pm_genpd_dev_always_on() PM / Domains: Introduce "always on" device flag PM / Domains: Fix hibernation restore of devices, v2 PM / Domains: Fix handling of wakeup devices during system resume sh_mmcif / PM: Use PM QoS latency constraint tmio_mmc / PM: Use PM QoS latency constraint PM / QoS: Make it possible to expose PM QoS latency constraints PM / Sleep: JBD and JBD2 missing set_freezable() PM / Domains: Fix include for PM_GENERIC_DOMAINS=n case PM / Freezer: Remove references to TIF_FREEZE in comments PM / Sleep: Add more wakeup source initialization routines PM / Hibernate: Enable usermodehelpers in hibernate() error path PM / Sleep: Make __pm_stay_awake() delete wakeup source timers PM / Sleep: Fix race conditions related to wakeup source timer function PM / Sleep: Fix possible infinite loop during wakeup source destruction PM / Hibernate: print physical addresses consistently with other parts of kernel ...
2012-03-20exit_signal: fix the "parent has changed security domain" logicOleg Nesterov
exit_notify() changes ->exit_signal if the parent already did exec. This doesn't really work, we are not going to send the signal now if there is another live thread or the exiting task is traced. The parent can exec before the last dies or the tracer detaches. Move this check into do_notify_parent() which actually sends the signal. The user-visible change is that we do not change ->exit_signal, and thus the exiting task is still "clone children" for do_wait()->eligible_child(__WCLONE). Hopefully this is fine, the current logic is racy anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-20exit_signal: simplify the "we have changed execution domain" logicOleg Nesterov
exit_notify() checks "tsk->self_exec_id != tsk->parent_exec_id" to handle the "we have changed execution domain" case. We can change do_thread() to always set ->exit_signal = SIGCHLD and remove this check to simplify the code. We could change setup_new_exec() instead, this looks more logical because it increments ->self_exec_id. But note that de_thread() already resets ->exit_signal if it changes the leader, let's keep both changes close to each other. Note that we change ->exit_signal lockless, this changes the rules. Thereafter ->exit_signal is not stable under tasklist but this is fine, the only possible change is OLDSIG -> SIGCHLD. This can race with eligible_child() but the race is harmless. We can race with reparent_leader() which changes our ->exit_signal in parallel, but it does the same change to SIGCHLD. The noticeable user-visible change is that the execing task is not "visible" to do_wait()->eligible_child(__WCLONE) right after exec. To me this looks more logical, and this is consistent with mt case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-09genirq: Get rid of unnecessary IRQTF_DIED flagAlexander Gordeev
Currently IRQTF_DIED flag is set when a IRQ thread handler calls do_exit() But also PF_EXITING per process flag gets set when a thread exits. This fix eliminates the duplicate by using PF_EXITING flag. Also, there is a race condition in exit_irq_thread(). In case a thread's bit is cleared in desc->threads_oneshot (and the IRQ line gets unmasked), but before IRQTF_DIED flag is set, a new interrupt might come in and set just cleared bit again, this time forever. This fix throws IRQTF_DIED flag away, eliminating the race as a result. [ tglx: Test THREAD_EXITING first as suggested by Oleg ] Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Alexander Gordeev <agordeev@redhat.com> Link: http://lkml.kernel.org/r/20120309135958.GD2114@dhcp-26-207.brq.redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2012-03-04Merge branch 'pm-sleep'Rafael J. Wysocki
* pm-sleep: PM / Freezer: Remove references to TIF_FREEZE in comments PM / Sleep: Add more wakeup source initialization routines PM / Hibernate: Enable usermodehelpers in hibernate() error path PM / Sleep: Make __pm_stay_awake() delete wakeup source timers PM / Sleep: Fix race conditions related to wakeup source timer function PM / Sleep: Fix possible infinite loop during wakeup source destruction PM / Hibernate: print physical addresses consistently with other parts of kernel PM: Add comment describing relationships between PM callbacks to pm.h PM / Sleep: Drop suspend_stats_update() PM / Sleep: Make enter_state() in kernel/power/suspend.c static PM / Sleep: Unify kerneldoc comments in kernel/power/suspend.c PM / Sleep: Remove unnecessary label from suspend_freeze_processes() PM / Sleep: Do not check wakeup too often in try_to_freeze_tasks() PM / Sleep: Initialize wakeup source locks in wakeup_source_add() PM / Hibernate: Refactor and simplify freezer_test_done PM / Hibernate: Thaw kernel threads in hibernation_snapshot() in error/test path PM / Freezer / Docs: Document the beauty of freeze/thaw semantics PM / Suspend: Avoid code duplication in suspend statistics update PM / Sleep: Introduce generic callbacks for new device PM phases PM / Sleep: Introduce "late suspend" and "early resume" of devices
2012-03-04PM / Freezer: Remove references to TIF_FREEZE in commentsMarcos Paulo de Souza
This patch removes all the references in the code about the TIF_FREEZE flag removed by commit a3201227f803ad7fd43180c5195dbe5a2bf998aa freezer: make freezing() test freeze conditions in effect instead of TIF_FREEZE There still are some references to TIF_FREEZE in Documentation/power/freezing-of-tasks.txt, but it looks like that documentation needs more thorough work to reflect how the new freezer works, and hence merely removing the references to TIF_FREEZE won't really help. So I have not touched that part in this patch. Suggested-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-02-19Replace the fd_sets in struct fdtable with an array of unsigned longsDavid Howells
Replace the fd_sets in struct fdtable with an array of unsigned longs and then use the standard non-atomic bit operations rather than the FD_* macros. This: (1) Removes the abuses of struct fd_set: (a) Since we don't want to allocate a full fd_set the vast majority of the time, we actually, in effect, just allocate a just-big-enough array of unsigned longs and cast it to an fd_set type - so why bother with the fd_set at all? (b) Some places outside of the core fdtable handling code (such as SELinux) want to look inside the array of unsigned longs hidden inside the fd_set struct for more efficient iteration over the entire set. (2) Eliminates the use of FD_*() macros in the kernel completely. (3) Permits the __FD_*() macros to be deleted entirely where not exposed to userspace. Signed-off-by: David Howells <dhowells@redhat.com> Link: http://lkml.kernel.org/r/20120216174954.23314.48147.stgit@warthog.procyon.org.uk Signed-off-by: H. Peter Anvin <hpa@zytor.com> Cc: Al Viro <viro@zeniv.linux.org.uk>
2012-02-14security: trim security.hAl Viro
Trim security.h Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: James Morris <jmorris@namei.org>
2012-01-27sched: Fix ancient race in do_exit()Yasunori Goto
try_to_wake_up() has a problem which may change status from TASK_DEAD to TASK_RUNNING in race condition with SMI or guest environment of virtual machine. As a result, exited task is scheduled() again and panic occurs. Here is the sequence how it occurs: ----------------------------------+----------------------------- | CPU A | CPU B ----------------------------------+----------------------------- TASK A calls exit().... do_exit() exit_mm() down_read(mm->mmap_sem); rwsem_down_failed_common() set TASK_UNINTERRUPTIBLE set waiter.task <= task A list_add to sem->wait_list : raw_spin_unlock_irq() (I/O interruption occured) __rwsem_do_wake(mmap_sem) list_del(&waiter->list); waiter->task = NULL wake_up_process(task A) try_to_wake_up() (task is still TASK_UNINTERRUPTIBLE) p->on_rq is still 1.) ttwu_do_wakeup() (*A) : (I/O interruption handler finished) if (!waiter.task) schedule() is not called due to waiter.task is NULL. tsk->state = TASK_RUNNING : check_preempt_curr(); : task->state = TASK_DEAD (*B) <--- set TASK_RUNNING (*C) schedule() (exit task is running again) BUG_ON() is called! -------------------------------------------------------- The execution time between (*A) and (*B) is usually very short, because the interruption is disabled, and setting TASK_RUNNING at (*C) must be executed before setting TASK_DEAD. HOWEVER, if SMI is interrupted between (*A) and (*B), (*C) is able to execute AFTER setting TASK_DEAD! Then, exited task is scheduled again, and BUG_ON() is called.... If the system works on guest system of virtual machine, the time between (*A) and (*B) may be also long due to scheduling of hypervisor, and same phenomenon can occur. By this patch, do_exit() waits for releasing task->pi_lock which is used in try_to_wake_up(). It guarantees the task becomes TASK_DEAD after waking up. Signed-off-by: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20120117174031.3118.E1E9C6FF@jp.fujitsu.com Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-01-17Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit: (29 commits) audit: no leading space in audit_log_d_path prefix audit: treat s_id as an untrusted string audit: fix signedness bug in audit_log_execve_info() audit: comparison on interprocess fields audit: implement all object interfield comparisons audit: allow interfield comparison between gid and ogid audit: complex interfield comparison helper audit: allow interfield comparison in audit rules Kernel: Audit Support For The ARM Platform audit: do not call audit_getname on error audit: only allow tasks to set their loginuid if it is -1 audit: remove task argument to audit_set_loginuid audit: allow audit matching on inode gid audit: allow matching on obj_uid audit: remove audit_finish_fork as it can't be called audit: reject entry,always rules audit: inline audit_free to simplify the look of generic code audit: drop audit_set_macxattr as it doesn't do anything audit: inline checks for not needing to collect aux records audit: drop some potentially inadvisable likely notations ... Use evil merge to fix up grammar mistakes in Kconfig file. Bad speling and horrible grammar (and copious swearing) is to be expected, but let's keep it to commit messages and comments, rather than expose it to users in config help texts or printouts.
2012-01-17audit: inline audit_free to simplify the look of generic codeEric Paris
make the conditional a static inline instead of doing it in generic code. Signed-off-by: Eric Paris <eparis@redhat.com>
2012-01-12treewide: remove useless NORET_TYPE macro and usesJoe Perches
It's a very old and now unused prototype marking so just delete it. Neaten panic pointer argument style to keep checkpatch quiet. Signed-off-by: Joe Perches <joe@perches.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Ralf Baechle <ralf@linux-mips.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>