aboutsummaryrefslogtreecommitdiff
path: root/ipc
AgeCommit message (Collapse)Author
2015-05-08ipc/mqueue: Implement lockless pipelined wakeupsDavidlohr Bueso
This patch moves the wakeup_process() invocation so it is not done under the info->lock by making use of a lockless wake_q. With this change, the waiter is woken up once it is STATE_READY and it does not need to loop on SMP if it is still in STATE_PENDING. In the timeout case we still need to grab the info->lock to verify the state. This change should also avoid the introduction of preempt_disable() in -rt which avoids a busy-loop which pools for the STATE_PENDING -> STATE_READY change if the waiter has a higher priority compared to the waker. Additionally, this patch micro-optimizes wq_sleep by using the cheaper cousin of set_current_state(TASK_INTERRUPTABLE) as we will block no matter what, thus get rid of the implied barrier. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: George Spelvin <linux@horizon.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Chris Mason <clm@fb.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: dave@stgolabs.net Link: http://lkml.kernel.org/r/1430748166.1940.17.camel@stgolabs.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-26Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull fourth vfs update from Al Viro: "d_inode() annotations from David Howells (sat in for-next since before the beginning of merge window) + four assorted fixes" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RCU pathwalk breakage when running into a symlink overmounting something fix I_DIO_WAKEUP definition direct-io: only inc/dec inode->i_dio_count for file systems fs/9p: fix readdir() VFS: assorted d_backing_inode() annotations VFS: fs/inode.c helpers: d_inode() annotations VFS: fs/cachefiles: d_backing_inode() annotations VFS: fs library helpers: d_inode() annotations VFS: assorted weird filesystems: d_inode() annotations VFS: normal filesystems (and lustre): d_inode() annotations VFS: security/: d_inode() annotations VFS: security/: d_backing_inode() annotations VFS: net/: d_inode() annotations VFS: net/unix: d_backing_inode() annotations VFS: kernel/: d_inode() annotations VFS: audit: d_backing_inode() annotations VFS: Fix up some ->d_inode accesses in the chelsio driver VFS: Cachefiles should perform fs modifications on the top layer only VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-15ipc: remove use of seq_printf return valueJoe Perches
The seq_printf return value, because it's frequently misused, will eventually be converted to void. See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to seq_has_overflowed() and make public") Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15VFS: assorted weird filesystems: d_inode() annotationsDavid Howells
Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-17ipc,sem: use current->state helpersDavidlohr Bueso
Call __set_current_state() instead of assigning the new state directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP environments, keeping track of who changed the state. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-16Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile #2 from Al Viro: "Next pile (and there'll be one or two more). The large piece in this one is getting rid of /proc/*/ns/* weirdness; among other things, it allows to (finally) make nameidata completely opaque outside of fs/namei.c, making for easier further cleanups in there" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: coda_venus_readdir(): use file_inode() fs/namei.c: fold link_path_walk() call into path_init() path_init(): don't bother with LOOKUP_PARENT in argument fs/namei.c: new helper (path_cleanup()) path_init(): store the "base" pointer to file in nameidata itself make default ->i_fop have ->open() fail with ENXIO make nameidata completely opaque outside of fs/namei.c kill proc_ns completely take the targets of /proc/*/ns/* symlinks to separate fs bury struct proc_ns in fs/proc copy address of proc_ns_ops into ns_common new helpers: ns_alloc_inum/ns_free_inum make proc_ns_operations work with struct ns_common * instead of void * switch the rest of proc_ns_operations to working with &...->ns netns: switch ->get()/->put()/->install()/->inum() to working with &net->ns make mntns ->get()/->put()/->install()/->inum() work with &mnt_ns->ns common object embedded into various struct ....ns
2014-12-13shmdt: use i_size_read() instead of ->i_sizeDave Hansen
Andrew Morton noted http://lkml.kernel.org/r/20141104142027.a7a0d010772d84560b445f59@linux-foundation.org that the shmdt uses inode->i_size outside of i_mutex being held. There is one more case in shm.c in shm_destroy(). This converts both users over to use i_size_read(). Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13ipc/shm.c: fix overly aggressive shmdt() when calls span multiple segmentsDave Hansen
This is a highly-contrived scenario. But, a single shmdt() call can be induced in to unmapping memory from mulitple shm segments. Example code is here: http://www.sr71.net/~dave/intel/shmfun.c The fix is pretty simple: Record the 'struct file' for the first VMA we encounter and then stick to it. Decline to unmap anything not from the same file and thus the same segment. I found this by inspection and the odds of anyone hitting this in practice are pretty darn small. Lightly tested, but it's a pretty small patch. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13ipc/msg: increase MSGMNI, remove scalingManfred Spraul
SysV can be abused to allocate locked kernel memory. For most systems, a small limit doesn't make sense, see the discussion with regards to SHMMAX. Therefore: increase MSGMNI to the maximum supported. And: If we ignore the risk of locking too much memory, then an automatic scaling of MSGMNI doesn't make sense. Therefore the logic can be removed. The code preserves auto_msgmni to avoid breaking any user space applications that expect that the value exists. Notes: 1) If an administrator must limit the memory allocations, then he can set MSGMNI as necessary. Or he can disable sysv entirely (as e.g. done by Android). 2) MSGMAX and MSGMNB are intentionally not increased, as these values are used to control latency vs. throughput: If MSGMNB is large, then msgsnd() just returns and more messages can be queued before a task switch to a task that calls msgrcv() is forced. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-13ipc/sem.c: change memory barrier in sem_lock() to smp_rmb()Manfred Spraul
When I fixed bugs in the sem_lock() logic, I was more conservative than necessary. Therefore it is safe to replace the smp_mb() with smp_rmb(). And: With smp_rmb(), semop() syscalls are up to 10% faster. The race we must protect against is: sem->lock is free sma->complex_count = 0 sma->sem_perm.lock held by thread B thread A: A: spin_lock(&sem->lock) B: sma->complex_count++; (now 1) B: spin_unlock(&sma->sem_perm.lock); A: spin_is_locked(&sma->sem_perm.lock); A: XXXXX memory barrier A: if (sma->complex_count == 0) Thread A must read the increased complex_count value, i.e. the read must not be reordered with the read of sem_perm.lock done by spin_is_locked(). Since it's about ordering of reads, smp_rmb() is sufficient. [akpm@linux-foundation.org: update sem_lock() comment, from Davidlohr] Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10Merge branch 'nsfs' into for-nextAl Viro
2014-12-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull VFS changes from Al Viro: "First pile out of several (there _definitely_ will be more). Stuff in this one: - unification of d_splice_alias()/d_materialize_unique() - iov_iter rewrite - killing a bunch of ->f_path.dentry users (and f_dentry macro). Getting that completed will make life much simpler for unionmount/overlayfs, since then we'll be able to limit the places sensitive to file _dentry_ to reasonably few. Which allows to have file_inode(file) pointing to inode in a covered layer, with dentry pointing to (negative) dentry in union one. Still not complete, but much closer now. - crapectomy in lustre (dead code removal, mostly) - "let's make seq_printf return nothing" preparations - assorted cleanups and fixes There _definitely_ will be more piles" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) copy_from_iter_nocache() new helper: iov_iter_kvec() csum_and_copy_..._iter() iov_iter.c: handle ITER_KVEC directly iov_iter.c: convert copy_to_iter() to iterate_and_advance iov_iter.c: convert copy_from_iter() to iterate_and_advance iov_iter.c: get rid of bvec_copy_page_{to,from}_iter() iov_iter.c: convert iov_iter_zero() to iterate_and_advance iov_iter.c: convert iov_iter_get_pages_alloc() to iterate_all_kinds iov_iter.c: convert iov_iter_get_pages() to iterate_all_kinds iov_iter.c: convert iov_iter_npages() to iterate_all_kinds iov_iter.c: iterate_and_advance iov_iter.c: macros for iterating over iov_iter kill f_dentry macro dcache: fix kmemcheck warning in switch_names new helper: audit_file() nfsd_vfs_write(): use file_inode() ncpfs: use file_inode() kill f_dentry uses lockd: get rid of ->f_path.dentry->d_sb ...
2014-12-04copy address of proc_ns_ops into ns_commonAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04new helpers: ns_alloc_inum/ns_free_inumAl Viro
take struct ns_common *, for now simply wrappers around proc_{alloc,free}_inum() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04make proc_ns_operations work with struct ns_common * instead of void *Al Viro
We can do that now. And kill ->inum(), while we are at it - all instances are identical. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04switch the rest of proc_ns_operations to working with &...->nsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-04common object embedded into various struct ....nsAl Viro
for now - just move corresponding ->proc_inum instances over there Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-12-03ipc/sem.c: fully initialize sem_array before making it visibleManfred Spraul
ipc_addid() makes a new ipc identifier visible to everyone. New objects start as locked, so that the caller can complete the initialization after the call. Within struct sem_array, at least sma->sem_base and sma->sem_nsems are accessed without any locks, therefore this approach doesn't work. Thus: Move the ipc_addid() to the end of the initialization. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Rik van Riel <riel@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-11-19new helper: audit_file()Al Viro
... for situations when we don't have any candidate in pathnames - basically, in descriptor-based syscalls. [Folded the build fix for !CONFIG_AUDITSYSCALL configs from Chen Gang] Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2014-10-14ipc: resolve shadow warningsMark Rustad
Resolve some shadow warnings produced in W=2 builds by changing the name of some parameters and local variables. Change instances of "s64" because that clashes with the well-known typedef. Also change a local variable with the name "up" because that clashes with the name of of the "up" function for semaphores. These are hazards so eliminate the hazards by renaming them. Signed-off-by: Mark Rustad <mark.d.rustad@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14ipc/util.c: use __seq_open_private() instead of seq_open()Rob Jones
Using __seq_open_private() removes boilerplate code from sysvipc_proc_open(). The resultant code is shorter and easier to follow. However, please note that __seq_open_private() call kzalloc() rather than kmalloc() which may affect timing due to the memory initialisation overhead. Signed-off-by: Rob Jones <rob.jones@codethink.co.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14ipc/shm: kill the historical/wrong mm->start_stack checkOleg Nesterov
do_shmat() is the only user of ->start_stack (proc just reports its value), and this check looks ugly and wrong. The reason for this check is not clear at all, and it wrongly assumes that the stack can only grow down. But the main problem is that in general mm->start_stack has nothing to do with stack_vma->vm_start. Not only the application can switch to another stack and even unmap this area, setup_arg_pages() expands the stack without updating mm->start_stack during exec(). This means that in the likely case "addr > start_stack - size - PAGE_SIZE * 5" is simply impossible after find_vma_intersection() == F, or the stack can't grow anyway because of RLIMIT_STACK. Many thanks to Hugh for his explanations. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-14ipc: always handle a new value of auto_msgmniAndrey Vagin
proc_dointvec_minmax() returns zero if a new value has been set. So we don't need to check all charecters have been handled. Below you can find two examples. In the new value has not been handled properly. $ strace ./a.out open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3 write(3, "0\n\0", 3) = 2 close(3) = 0 exit_group(0) $ cat /sys/kernel/debug/tracing/trace $strace ./a.out open("/proc/sys/kernel/auto_msgmni", O_WRONLY) = 3 write(3, "0\n", 2) = 2 close(3) = 0 $ cat /sys/kernel/debug/tracing/trace a.out-697 [000] .... 3280.998235: unregister_ipcns_notifier <-proc_ipcauto_dointvec_minmax Fixes: 9eefe520c814 ("ipc: do not use a negative value to re-enable msgmni automatic recomputin") Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Mathias Krause <minipli@googlemail.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Joe Perches <joe@perches.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-10-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull "trivial tree" updates from Jiri Kosina: "Usual pile from trivial tree everyone is so eagerly waiting for" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (39 commits) Remove MN10300_PROC_MN2WS0038 mei: fix comments treewide: Fix typos in Kconfig kprobes: update jprobe_example.c for do_fork() change Documentation: change "&" to "and" in Documentation/applying-patches.txt Documentation: remove obsolete pcmcia-cs from Changes Documentation: update links in Changes Documentation: Docbook: Fix generated DocBook/kernel-api.xml score: Remove GENERIC_HAS_IOMAP gpio: fix 'CONFIG_GPIO_IRQCHIP' comments tty: doc: Fix grammar in serial/tty dma-debug: modify check_for_stack output treewide: fix errors in printk genirq: fix reference in devm_request_threaded_irq comment treewide: fix synchronize_rcu() in comments checkstack.pl: port to AArch64 doc: queue-sysfs: minor fixes init/do_mounts: better syntax description MIPS: fix comment spelling powerpc/simpleboot: fix comment ...
2014-09-09Documentation: Docbook: Fix generated DocBook/kernel-api.xmlMasanari Iida
This patch fix spelling typo found in DocBook/kernel-api.xml. It is because the file is generated from the source comments, I have to fix the comments in source codes. Signed-off-by: Masanari Iida <standby24x7@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2014-08-09Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull namespace updates from Eric Biederman: "This is a bunch of small changes built against 3.16-rc6. The most significant change for users is the first patch which makes setns drmatically faster by removing unneded rcu handling. The next chunk of changes are so that "mount -o remount,.." will not allow the user namespace root to drop flags on a mount set by the system wide root. Aks this forces read-only mounts to stay read-only, no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec mounts to stay no exec and it prevents unprivileged users from messing with a mounts atime settings. I have included my test case as the last patch in this series so people performing backports can verify this change works correctly. The next change fixes a bug in NFS that was discovered while auditing nsproxy users for the first optimization. Today you can oops the kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever with pid namespaces. I rebased and fixed the build of the !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given that no one to my knowledge bases anything on my tree fixing the typo in place seems more responsible that requiring a typo-fix to be backported as well. The last change is a small semantic cleanup introducing /proc/thread-self and pointing /proc/mounts and /proc/net at it. This prevents several kinds of problemantic corner cases. It is a user-visible change so it has a minute chance of causing regressions so the change to /proc/mounts and /proc/net are individual one line commits that can be trivially reverted. Unfortunately I lost and could not find the email of the original reporter so he is not credited. From at least one perspective this change to /proc/net is a refgression fix to allow pthread /proc/net uses that were broken by the introduction of the network namespace" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts proc: Point /proc/net at /proc/thread-self/net instead of /proc/self/net proc: Implement /proc/thread-self to point at the directory of the current thread proc: Have net show up under /proc/<tgid>/task/<tid> NFS: Fix /proc/fs/nfsfs/servers and /proc/fs/nfsfs/volumes mnt: Add tests for unprivileged remount cases that have found to be faulty mnt: Change the default remount atime from relatime to the existing value mnt: Correct permission checks in do_remount mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount mnt: Only change user settable mount flags in remount namespaces: Use task_lock and not rcu to protect nsproxy
2014-08-08shm: allow exit_shm in parallel if only marking orphansJack Miller
If shm_rmid_force (the default state) is not set then the shmids are only marked as orphaned and does not require any add, delete, or locking of the tree structure. Seperate the sysctl on and off case, and only obtain the read lock. The newly added list head can be deleted under the read lock because we are only called with current and will only change the semids allocated by this task and not manipulate the list. This commit assumes that up_read includes a sufficient memory barrier for the writes to be seen my others that later obtain a write lock. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Jack Miller <millerjo@us.ibm.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08shm: make exit_shm work proportional to task activityJack Miller
This is small set of patches our team has had kicking around for a few versions internally that fixes tasks getting hung on shm_exit when there are many threads hammering it at once. Anton wrote a simple test to cause the issue: http://ozlabs.org/~anton/junkcode/bust_shm_exit.c Before applying this patchset, this test code will cause either hanging tracebacks or pthread out of memory errors. After this patchset, it will still produce output like: root@somehost:~# ./bust_shm_exit 1024 160 ... INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113) INFO: Stall ended before state dump start ... But the task will continue to run along happily, so we consider this an improvement over hanging, even if it's a bit noisy. This patch (of 3): exit_shm obtains the ipc_ns shm rwsem for write and holds it while it walks every shared memory segment in the namespace. Thus the amount of work is related to the number of shm segments in the namespace not the number of segments that might need to be cleaned. In addition, this occurs after the task has been notified the thread has exited, so the number of tasks waiting for the ns shm rwsem can grow without bound until memory is exausted. Add a list to the task struct of all shmids allocated by this task. Init the list head in copy_process. Use the ns->rwsem for locking. Add segments after id is added, remove before removing from id. On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar to handling of semaphore undo. I chose a define for the init sequence since its a simple list init, otherwise it would require a function call to avoid include loops between the semaphore code and the task struct. Converting the list_del to list_del_init for the unshare cases would remove the exit followed by init, but I left it blow up if not inited. Signed-off-by: Milton Miller <miltonm@bga.com> Signed-off-by: Jack Miller <millerjo@us.ibm.com> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Anton Blanchard <anton@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-07-29namespaces: Use task_lock and not rcu to protect nsproxyEric W. Biederman
The synchronous syncrhonize_rcu in switch_task_namespaces makes setns a sufficiently expensive system call that people have complained. Upon inspect nsproxy no longer needs rcu protection for remote reads. remote reads are rare. So optimize for same process reads and write by switching using rask_lock instead. This yields a simpler to understand lock, and a faster setns system call. In particular this fixes a performance regression observed by Rafael David Tinoco <rafael.tinoco@canonical.com>. This is effectively a revert of Pavel Emelyanov's commit cf7b708c8d1d7a27736771bcf4c457b332b0f818 Make access to task's nsproxy lighter from 2007. The race this originialy fixed no longer exists as do_notify_parent uses task_active_pid_ns(parent) instead of parent->nsproxy. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2014-06-06ipc: convert use of typedef ctl_table to struct ctl_tableJoe Perches
This typedef is unnecessary and should just be removed. Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: add a printk_once for semctl(GETNCNT/GETZCNT)Manfred Spraul
The actual Linux implementation for semctl(GETNCNT) and semctl(GETZCNT) always (since 0.99.10) reported a thread as sleeping on all semaphores that are listed in the semop() call. The documented behavior (both in the Linux man page and in the Single Unix Specification) is that a task should be reported on exactly one semaphore: The semaphore that caused the thread to got to sleep. This patch adds a pr_info_once() that is triggered if a thread hits the relevant case. The code triggers slightly too often, otherwise it would be necessary to replicate the old code. As there are no known users of GETNCNT or GETZCNT, this is done to prevent unnecessary bloat. The task that triggered is reported with name (tsk->comm) and pid. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: make semctl(,,{GETNCNT,GETZCNT}) standard compliantManfred Spraul
SUSv4 clearly defines how semncnt and semzcnt must be calculated: A task waits on exactly one semaphore: The semaphore from the first operation in the sop array that cannot proceed. The Linux implementation never followed the standard, it tried to count all semaphores that might be the reason why a task sleeps. This patch fixes that. Note: a) The implementation assumes that GETNCNT and GETZCNT are rare operations, therefore the code counts them only on demand. (If they wouldn't be rare, then the non-compliance would have been found earlier) b) compared to the initial version of the patch, the BUG_ONs were removed and it was clarified that the new behavior conforms to SUS. Back-compatibility concerns: Manfred: : - there is no application in Fedora that uses GETNCNT or GETZCNT. : : - application that use only single-sop semop() are also safe, the : difference only affects complex apps. : : - portable application are also safe, the new behavior is standard : compliant. : : But that's it. The old behavior existed in Linux from 0.99.something : until now. Michael: : * These operations seem to be very little used. Grepping the public : source that is contained Fedora 20 source DVD, there appear to be no : uses. Of course, this says nothing about uses in private / : non-mainstream FOSS code, but it seems likely that the same pattern : is followed there. : : * The existing behavior is hard enough to understand that I suspect : that no one understood it well enough to rely on it anyway : (especially as that behavior contradicted both man page and POSIX). : : So, there's a chance of breakage, but I estimate that it's minute. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: store which operation blocks in perform_atomic_semop()Manfred Spraul
Preparation for the next patch: In the slow-path of perform_atomic_semop(), store a pointer to the operation that caused the operation to block. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: change perform_atomic_semop parametersManfred Spraul
Right now, perform_atomic_semop gets the content of sem_queue as individual fields. Changes that, instead pass a pointer to sem_queue. This is a preparation for the next patch: it uses sem_queue to store the reason why a task must sleep. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: remove code duplicationManfred Spraul
count_semzcnt and count_semncnt are more of less identical. The patch creates a single function that either counts the number of tasks waiting for zero or waiting due to a decrease operation. Compared to the initial version, the BUG_ONs were removed. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/sem.c: bugfix for semctl(,,GETZCNT)Manfred Spraul
GETZCNT is supposed to return the number of threads that wait until a semaphore value becomes 0. The current implementation overlooks complex operations that contain both wait-for-zero operation and operations that alter at least one semaphore. The patch fixes that. It's intentionally copy&paste, this will be cleaned up in the next patch. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <davidlohr.bueso@hp.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc,msg: document volatile r_msgDavidlohr Bueso
The need for volatile is not obvious, document it. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc,msg: move some msgq ns code aroundDavidlohr Bueso
Nothing big and no logical changes, just get rid of some redundant function declarations. Move msg_[init/exit]_ns down the end of the file. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc,msg: use current->state helpersDavidlohr Bueso
Call __set_current_state() instead of assigning the new state directly. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Signed-off-by: Manfred Spraul <manfred@colorfullif.com> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/shm.c: check for integer overflow during shmget.Manfred Spraul
SHMMAX is the upper limit for the size of a shared memory segment, counted in bytes. The actual allocation is that size, rounded up to the next full page. Add a check that prevents the creation of segments where the rounded up size causes an integer overflow. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/shm.c: check for overflows of shm_totManfred Spraul
shm_tot counts the total number of pages used by shm segments. If SHMALL is ULONG_MAX (or nearly ULONG_MAX), then the number can overflow. Subsequent calls to shmctl(,SHM_INFO,) would return wrong values for shm_tot. The patch adds a detection for overflows. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc/shm.c: check for ulong overflows in shmatManfred Spraul
The increase of SHMMAX/SHMALL is a 4 patch series. The change itself is trivial, the only problem are interger overflows. The overflows are not new, but if we make huge values the default, then the code should be free from overflows. SHMMAX: - shmmem_file_setup places a hard limit on the segment size: MAX_LFS_FILESIZE. On 32-bit, the limit is > 1 TB, i.e. 4 GB-1 byte segments are possible. Rounded up to full pages the actual allocated size is 0. --> must be fixed, patch 3 - shmat: - find_vma_intersection does not handle overflows properly. --> must be fixed, patch 1 - the rest is fine, do_mmap_pgoff limits mappings to TASK_SIZE and checks for overflows (i.e.: map 2 GB, starting from addr=2.5GB fails). SHMALL: - after creating 8192 segments size (1L<<63)-1, shm_tot overflows and returns 0. --> must be fixed, patch 2. Userspace: - Obviously, there could be overflows in userspace. There is nothing we can do, only use values smaller than ULONG_MAX. I ended with "ULONG_MAX - 1L<<24": - TASK_SIZE cannot be used because it is the size of the current task. Could be 4G if it's a 32-bit task on a 64-bit kernel. - The maximum size is not standardized across archs: I found TASK_MAX_SIZE, TASK_SIZE_MAX and TASK_SIZE_64. - Just in case some arch revives a 4G/4G split, nearly ULONG_MAX is a valid segment size. - Using "0" as a magic value for infinity is even worse, because right now 0 means 0, i.e. fail all allocations. This patch (of 4): find_vma_intersection() does not work as intended if addr+size overflows. The patch adds a manual check before the call to find_vma_intersection. Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Acked-by: Davidlohr Bueso <davidlohr@hp.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc, kernel: clear whitespacePaul McQuade
trailing whitespace Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc, kernel: use Linux headersPaul McQuade
Use #include <linux/uaccess.h> instead of <asm/uaccess.h> Use #include <linux/types.h> instead of <asm/types.h> Signed-off-by: Paul McQuade <paulmcquad@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-06-06ipc: constify ipc_opsMathias Krause
There is no need to recreate the very same ipc_ops structure on every kernel entry for msgget/semget/shmget. Just declare it static and be done with it. While at it, constify it as we don't modify the structure at runtime. Found in the PaX patch, written by the PaX Team. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: PaX Team <pageexec@freemail.hu> Cc: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07ipc: use device_initcallDavidlohr Bueso
... since __initcall is now deprecated. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-07ipc/compat.c: remove sc_semopm macroDavidlohr Bueso
This macro appears to have been introduced back in the 2.5 era for semtimedop32 backward compatibility on ia32: https://lkml.org/lkml/2003/4/28/78 Nowadays, this syscall in compat just defaults back to the code found in sem.c, so it is no longer used and can thus be removed: long compat_sys_semtimedop(int semid, struct sembuf __user *tsems, unsigned nsops, const struct compat_timespec __user *timeout) { struct timespec __user *ts64; if (compat_convert_timespec(&ts64, timeout)) return -EFAULT; return sys_semtimedop(semid, tsems, nsops, ts64); } Furthermore, there are no users in compat.c. After this change, kernel builds just fine with both CONFIG_SYSVIPC_COMPAT and CONFIG_SYSVIPC. Signed-off-by: Davidlohr Bueso <davidlohr@hp.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-04-02Merge branch 'x86-x32-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull compat time conversion changes from Peter Anvin: "Despite the branch name this is really neither an x86 nor an x32-specific patchset, although it the implementation of the discussions that followed the x32 security hole a few months ago. This removes get/put_compat_timespec/val() and replaces them with compat_get/put_timespec/val() which are savvy as to the current status of COMPAT_USE_64BIT_TIME. It removes several unused and/or incorrect/misleading functions (like compat_put_timeval_convert which doesn't in fact do any conversion) and also replaces several open-coded implementations what is now called compat_convert_timespec() with that function" * 'x86-x32-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: compat: Fix sparse address space warnings compat: Get rid of (get|put)_compat_time(val|spec)
2014-03-31Merge branch 'compat' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 compat wrapper rework from Heiko Carstens: "S390 compat system call wrapper simplification work. The intention of this work is to get rid of all hand written assembly compat system call wrappers on s390, which perform proper sign or zero extension, or pointer conversion of compat system call parameters. Instead all of this should be done with C code eg by using Al's COMPAT_SYSCALL_DEFINEx() macro. Therefore all common code and s390 specific compat system calls have been converted to the COMPAT_SYSCALL_DEFINEx() macro. In order to generate correct code all compat system calls may only have eg compat_ulong_t parameters, but no unsigned long parameters. Those patches which change parameter types from unsigned long to compat_ulong_t parameters are separate in this series, but shouldn't cause any harm. The only compat system calls which intentionally have 64 bit parameters (preadv64 and pwritev64) in support of the x86/32 ABI haven't been changed, but are now only available if an architecture defines __ARCH_WANT_COMPAT_SYS_PREADV64/PWRITEV64. System calls which do not have a compat variant but still need proper zero extension on s390, like eg "long sys_brk(unsigned long brk)" will get a proper wrapper function with the new s390 specific COMPAT_SYSCALL_WRAPx() macro: COMPAT_SYSCALL_WRAP1(brk, unsigned long, brk); which generates the following code (simplified): asmlinkage long sys_brk(unsigned long brk); asmlinkage long compat_sys_brk(long brk) { return sys_brk((u32)brk); } Given that the C file which contains all the COMPAT_SYSCALL_WRAP lines includes both linux/syscall.h and linux/compat.h, it will generate build errors, if the declaration of sys_brk() doesn't match, or if there exists a non-matching compat_sys_brk() declaration. In addition this will intentionally result in a link error if somewhere else a compat_sys_brk() function exists, which probably should have been used instead. Two more BUILD_BUG_ONs make sure the size and type of each compat syscall parameter can be handled correctly with the s390 specific macros. I converted the compat system calls step by step to verify the generated code is correct and matches the previous code. In fact it did not always match, however that was always a bug in the hand written asm code. In result we get less code, less bugs, and much more sanity checking" * 'compat' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (44 commits) s390/compat: add copyright statement compat: include linux/unistd.h within linux/compat.h s390/compat: get rid of compat wrapper assembly code s390/compat: build error for large compat syscall args mm/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types kexec/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types net/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types ipc/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types fs/compat: convert to COMPAT_SYSCALL_DEFINE with changing parameter types ipc/compat: convert to COMPAT_SYSCALL_DEFINE fs/compat: convert to COMPAT_SYSCALL_DEFINE security/compat: convert to COMPAT_SYSCALL_DEFINE mm/compat: convert to COMPAT_SYSCALL_DEFINE net/compat: convert to COMPAT_SYSCALL_DEFINE kernel/compat: convert to COMPAT_SYSCALL_DEFINE fs/compat: optional preadv64/pwrite64 compat system calls ipc/compat_sys_msgrcv: change msgtyp type from long to compat_long_t s390/compat: partial parameter conversion within syscall wrappers s390/compat: automatic zero, sign and pointer conversion of syscalls s390/compat: add sync_file_range and fallocate compat syscalls ...
2014-03-16ipc: Fix 2 bugs in msgrcv() MSG_COPY implementationMichael Kerrisk
While testing and documenting the msgrcv() MSG_COPY flag that Stanislav Kinsbursky added in commit 4a674f34ba04 ("ipc: introduce message queue copy feature" => kernel 3.8), I discovered a couple of bugs in the implementation. The two bugs concern MSG_COPY interactions with other msgrcv() flags, namely: (A) MSG_COPY + MSG_EXCEPT (B) MSG_COPY + !IPC_NOWAIT The bugs are distinct (and the fix for the first one is obvious), however my fix for both is a single-line patch, which is why I'm combining them in a single mail, rather than writing two mails+patches. ===== (A) MSG_COPY + MSG_EXCEPT ===== With the addition of the MSG_COPY flag, there are now two msgrcv() flags--MSG_COPY and MSG_EXCEPT--that modify the meaning of the 'msgtyp' argument in unrelated ways. Specifying both in the same call is a logical error that is currently permitted, with the effect that MSG_COPY has priority and MSG_EXCEPT is ignored. The call should give an error if both flags are specified. The patch below implements that behavior. ===== (B) (B) MSG_COPY + !IPC_NOWAIT ===== The test code that was submitted in commit 3a665531a3b7 ("selftests: IPC message queue copy feature test") shows MSG_COPY being used in conjunction with IPC_NOWAIT. In other words, if there is no message at the position 'msgtyp'. return immediately with the error in ENOMSG. What was not (fully) tested is the behavior if MSG_COPY is specified *without* IPC_NOWAIT, and there is an odd behavior. If the queue contains less than 'msgtyp' messages, then the call blocks until the next message is written to the queue. At that point, the msgrcv() call returns a copy of the newly added message, regardless of whether that message is at the ordinal position 'msgtyp'. This is clearly bogus, and problematic for applications that might want to make use of the MSG_COPY flag. I considered the following possible solutions to this problem: (1) Force the call to block until a message *does* appear at the position 'msgtyp'. (2) If the MSG_COPY flag is specified, the kernel should implicitly add IPC_NOWAIT, so that the call fails with ENOMSG for this case. (3) If the MSG_COPY flag is specified, but IPC_NOWAIT is not, generate an error (probably, EINVAL is the right one). I do not know if any application would really want to have the functionality of solution (1), especially since an application can determine in advance the number of messages in the queue using msgctl() IPC_STAT. Obviously, this solution would be the most work to implement. Solution (2) would have the effect of silently fixing any applications that tried to employ broken behavior. However, it would mean that if we later decided to implement solution (1), then user-space could not easily detect what the kernel supports (but, since I'm somewhat doubtful that solution (1) is needed, I'm not sure that this is much of a problem). Solution (3) would have the effect of informing broken applications that they are doing something broken. The downside is that this would cause a ABI breakage for any applications that are currently employing the broken behavior. However: a) Those applications are almost certainly not getting the results they expect. b) Possibly, those applications don't even exist, because MSG_COPY is currently hidden behind CONFIG_CHECKPOINT_RESTORE. The upside of solution (3) is that if we later decided to implement solution (1), user-space could determine what the kernel supports, via the error return. In my view, solution (3) is mildly preferable to solution (2), and solution (1) could still be done later if anyone really cares. The patch below implements solution (3). PS. For anyone out there still listening, it's the usual story: documenting an API (and the thinking about, and the testing of the API, that documentation entails) is the one of the single best ways of finding bugs in the API, as I've learned from a lot of experience. Best to do that documentation before releasing the API. Signed-off-by: Michael Kerrisk <mtk.manpages@gmail.com> Acked-by: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: Stanislav Kinsbursky <skinsbursky@parallels.com> Cc: stable@vger.kernel.org Cc: Serge Hallyn <serge.hallyn@canonical.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>