summaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2009-01-06swapfile: swapon use discard (trim)Hugh Dickins
When adding swap, all the old data on swap can be forgotten: sys_swapon() discard all but the header page of the swap partition (or every extent but the header of the swap file), to give a solidstate swap device the opportunity to optimize its wear-levelling. If that succeeds, note SWP_DISCARDABLE for later use, and report it with a "D" at the right end of the kernel's "Adding ... swap" message. Perhaps something should be shown in /proc/swaps (swapon -s), but we have to be more cautious before making any addition to that format. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06swapfile: rearrange scan and swap_infoHugh Dickins
Before making functional changes, rearrange scan_swap_map() to simplify subsequent diffs. Actually, there is one functional change in there: leave cluster_nr negative while scanning for a new cluster - resetting it early increased the likelihood that when we have difficulty finding a free cluster, another task may come in and try doing exactly the same - just a waste of cpu. Before making functional changes, rearrange struct swap_info_struct slightly: flags will be needed as an unsigned long (for wait_on_bit), next is a good int to pair with prio, old_block_size is uninteresting so shift it to the end. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06swapfile: remove SWP_ACTIVE maskHugh Dickins
Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: make vread() and vwrite() declarationKOSAKI Motohiro
Sparse output following warnings. mm/vmalloc.c:1436:6: warning: symbol 'vread' was not declared. Should it be static? mm/vmalloc.c:1474:6: warning: symbol 'vwrite' was not declared. Should it be static? However, it is used by /dev/kmem. fixed here. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: optimize get_scan_ratio for no swapHugh Dickins
Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP. Yes, the gcc optimizer gives us that, when nr_swap_pages is #defined as 0L. Move usual declaration to swapfile.c: it never belonged in page_alloc.c. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: add add_to_swap stubHugh Dickins
If we add a failing stub for add_to_swap(), then we can remove the #ifdef CONFIG_SWAP from mm/vmscan.c. This was intended as a source cleanup, but looking more closely, it turns out that the !CONFIG_SWAP case was going to keep_locked for an anonymous page, whereas now it goes to the more suitable activate_locked, like the CONFIG_SWAP nr_swap_pages 0 case. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: remove gfp_mask from add_to_swapHugh Dickins
Remove gfp_mask argument from add_to_swap(): it's misleading because its only caller, shrink_page_list(), is not atomic at that point; and in due course (implementing discard) we'll sometimes want to allocate some memory with GFP_NOIO (as is used in swap_writepage) when allocating swap. No change to the gfp_mask passed down to add_to_swap_cache(): still use __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before): though it's not obvious if that's the best combination to ask for here. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: try_to_free_swap replaces remove_exclusive_swap_pageHugh Dickins
remove_exclusive_swap_page(): its problem is in living up to its name. It doesn't matter if someone else has a reference to the page (raised page_count); it doesn't matter if the page is mapped into userspace (raised page_mapcount - though that hints it may be worth keeping the swap): all that matters is that there be no more references to the swap (and no writeback in progress). swapoff (try_to_unuse) has been removing pages from swapcache for years, with no concern for page count or page mapcount, and we used to have a comment in lookup_swap_cache() recognizing that: if you go for a page of swapcache, you'll get the right page, but it could have been removed from swapcache by the time you get page lock. So, give up asking for exclusivity: get rid of remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and remove_exclusive_swap_page_count() which were spawned for the recent LRU work: replace them by the simpler try_to_free_swap() which just checks page_swapcount(). Similarly, remove the page_count limitation from free_swap_and_count(), but assume that it's worth holding on to the swap if page is mapped and swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()? It would be consistent, but I think we probably have enough for now. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: reuse_swap_page replaces can_share_swap_pageHugh Dickins
A good place to free up old swap is where do_wp_page(), or do_swap_page(), is about to redirty the page: the data on disk is then stale and won't be read again; and if we do decide to write the page out later, using the previous swap location makes an unnecessary disk seek very likely. So give can_share_swap_page() the side-effect of delete_from_swap_cache() when it safely can. And can_share_swap_page() was always a misleading name, the more so if it has a side-effect: rename it reuse_swap_page(). Irrelevant cleanup nearby: remove swap_token_default_timeout definition from swap.h: it's used nowhere. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: add dirty_background_bytes and dirty_bytes sysctlsDavid Rientjes
This change introduces two new sysctls to /proc/sys/vm: dirty_background_bytes and dirty_bytes. dirty_background_bytes is the counterpart to dirty_background_ratio and dirty_bytes is the counterpart to dirty_ratio. With growing memory capacities of individual machines, it's no longer sufficient to specify dirty thresholds as a percentage of the amount of dirtyable memory over the entire system. dirty_background_bytes and dirty_bytes specify quantities of memory, in bytes, that represent the dirty limits for the entire system. If either of these values is set, its value represents the amount of dirty memory that is needed to commence either background or direct writeback. When a `bytes' or `ratio' file is written, its counterpart becomes a function of the written value. For example, if dirty_bytes is written to be 8096, 8K of memory is required to commence direct writeback. dirty_ratio is then functionally equivalent to 8K / the amount of dirtyable memory: dirtyable_memory = free pages + mapped pages + file cache dirty_background_bytes = dirty_background_ratio * dirtyable_memory -or- dirty_background_ratio = dirty_background_bytes / dirtyable_memory AND dirty_bytes = dirty_ratio * dirtyable_memory -or- dirty_ratio = dirty_bytes / dirtyable_memory Only one of dirty_background_bytes and dirty_background_ratio may be specified at a time, and only one of dirty_bytes and dirty_ratio may be specified. When one sysctl is written, the other appears as 0 when read. The `bytes' files operate on a page size granularity since dirty limits are compared with ZVC values, which are in page units. Prior to this change, the minimum dirty_ratio was 5 as implemented by get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user written value between 0 and 100. This restriction is maintained, but dirty_bytes has a lower limit of only one page. Also prior to this change, the dirty_background_ratio could not equal or exceed dirty_ratio. This restriction is maintained in addition to restricting dirty_background_bytes. If either background threshold equals or exceeds that of the dirty threshold, it is implicitly set to half the dirty threshold. Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: change dirty limit type specifiers to unsigned longDavid Rientjes
The background dirty and dirty limits are better defined with type specifiers of unsigned long since negative writeback thresholds are not possible. These values, as returned by get_dirty_limits(), are normally compared with ZVC values to determine whether writeback shall commence or be throttled. Such page counts cannot be negative, so declaring the page limits as signed is unnecessary. Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: David Rientjes <rientjes@google.com> Cc: Andrea Righi <righi.andrea@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: make page_lock_anon_vma() staticHugh Dickins
page_lock_anon_vma() and page_unlock_anon_vma() were made available to show_page_path() in vmscan.c; but now that has been removed, make them static in rmap.c again, they're better kept private if possible. Signed-off-by: Hugh Dickins <hugh@veritas.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: add_active_or_unevictable into rmapHugh Dickins
lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always appear together. Save some symbol table space and some jumping around by removing lru_cache_add_active_or_unevictable(), folding its code into page_add_new_anon_rmap(): like how we add file pages to lru just after adding them to page cache. Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as originally intended. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: add Set,ClearPageSwapCache stubsHugh Dickins
If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: remove GFP_HIGHUSER_PAGECACHEHugh Dickins
GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making that harder to track down: remove it, and its out-of-work brothers GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE. Since we're making that improvement to hotremove_migrate_alloc(), I think we can now also remove one of the "o"s from its comment. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: remove cgroup_mm_owner_callbacksHugh Dickins
cgroup_mm_owner_callbacks() was brought in to support the memrlimit controller, but sneaked into mainline ahead of it. That controller has now been shelved, and the mm_owner_changed() args were inadequate for it anyway (they needed an mm pointer instead of a task pointer). Remove the dead code, and restore mm_update_next_owner() locking to how it was before: taking mmap_sem there does nothing for memcontrol.c, now the only user of mm->owner. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: cleanup: remove #ifdef CONFIG_MIGRATIONKOSAKI Motohiro
#ifdef in *.c file decrease source readability a bit. removing is better. This patch doesn't have any functional change. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: get rid of pagevec_release_nonlru()KOSAKI Motohiro
speculative page references patch (commit: e286781d5f2e9c846e012a39653a166e9d31777d) removed last pagevec_release_nonlru() caller. So this function can be removed now. This patch doesn't have any functional change. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: show node to memory section relationship with symlinks in sysfsGary Hade
Show node to memory section relationship with symlinks in sysfs Add /sys/devices/system/node/nodeX/memoryY symlinks for all the memory sections located on nodeX. For example: /sys/devices/system/node/node1/memory135 -> ../../memory/memory135 indicates that memory section 135 resides on node1. Also revises documentation to cover this change as well as updating Documentation/ABI/testing/sysfs-devices-memory to include descriptions of memory hotremove files 'phys_device', 'phys_index', and 'state' that were previously not described there. In addition to it always being a good policy to provide users with the maximum possible amount of physical location information for resources that can be hot-added and/or hot-removed, the following are some (but likely not all) of the user benefits provided by this change. Immediate: - Provides information needed to determine the specific node on which a defective DIMM is located. This will reduce system downtime when the node or defective DIMM is swapped out. - Prevents unintended onlining of a memory section that was previously offlined due to a defective DIMM. This could happen during node hot-add when the user or node hot-add assist script onlines _all_ offlined sections due to user or script inability to identify the specific memory sections located on the hot-added node. The consequences of reintroducing the defective memory could be ugly. - Provides information needed to vary the amount and distribution of memory on specific nodes for testing or debugging purposes. Future: - Will provide information needed to identify the memory sections that need to be offlined prior to physical removal of a specific node. Symlink creation during boot was tested on 2-node x86_64, 2-node ppc64, and 2-node ia64 systems. Symlink creation during physical memory hot-add tested on a 2-node x86_64 system. Signed-off-by: Gary Hade <garyhade@us.ibm.com> Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06oom: print triggering task's cpuset and mems allowedDavid Rientjes
When cpusets are enabled, it's necessary to print the triggering task's set of allowable nodes so the subsequently printed meminfo can be interpreted correctly. We also print the task's cpuset name for informational purposes. [rientjes@google.com: task lock current before dereferencing cpuset] Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: invoke oom-killer from page faultNick Piggin
Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer. With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time. Create a hook for pagefault oom path to call into instead. Only converted x86 and uml so far. [akpm@linux-foundation.org: make __out_of_memory() static] [akpm@linux-foundation.org: fix comment] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Jeff Dike <jdike@addtoit.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: report the MMU pagesize in /proc/pid/smapsMel Gorman
The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the kernel to back a VMA. This matches the size used by the MMU in the majority of cases. However, one counter-example occurs on PPC64 kernels whereby a kernel using 64K as a base pagesize may still use 4K pages for the MMU on older processor. To distinguish, this patch reports MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06mm: report the pagesize backing a VMA in /proc/pid/smapsMel Gorman
It is useful to verify a hugepage-aware application is using the expected pagesizes for its memory regions. This patch creates an entry called KernelPageSize in /proc/pid/smaps that is the size of page used by the kernel to back a VMA. The entry is not called PageSize as it is possible the MMU uses a different size. This extension should not break any sensible parser that skips lines containing unrecognised information. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dmLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-2.6-dm: dm snapshot: extend exception store functions dm snapshot: split out exception store implementations dm snapshot: rename struct exception_store dm snapshot: separate out exception store interface dm mpath: move trigger_event to system workqueue dm: add name and uuid to sysfs dm table: rework reference counting dm: support barriers on simple devices dm request: extend target interface dm request: add caches dm ioctl: allow dm_copy_name_and_uuid to return only one field dm log: ensure log bitmap fits on log device dm log: move region_size validation dm log: avoid reinitialising io_req on every operation dm: consolidate target deregistration error handling dm raid1: fix error count dm log: fix dm_io_client leak on error paths dm snapshot: change yield to msleep dm table: drop reference at unbind
2009-01-06dm: support barriers on simple devicesAndi Kleen
Implement barrier support for single device DM devices This patch implements barrier support in DM for the common case of dm linear just remapping a single underlying device. In this case we can safely pass the barrier through because there can be no reordering between devices. NB. Any DM device might cease to support barriers if it gets reconfigured so code must continue to allow for a possible -EOPNOTSUPP on every barrier bio submitted. - agk Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm request: extend target interfaceKiyoshi Ueda
This patch adds the following target interfaces for request-based dm. map_rq : for mapping a request rq_end_io : for finishing a request busy : for avoiding performance regression from bio-based dm. Target can tell dm core not to map requests now, and that may help requests in the block layer queue to be bigger by I/O merging. In bio-based dm, this behavior is done by device drivers managing the block layer queue. But in request-based dm, dm core has to do that since dm core manages the block layer queue. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-06dm: consolidate target deregistration error handlingMikulas Patocka
Change dm_unregister_target to return void and use BUG() for error reporting. dm_unregister_target can only fail because of programming bug in the target driver. It can't fail because of user's behavior or disk errors. This patch changes unregister_target to return void and use BUG if someone tries to unregister non-registered target or unregister target that is in use. This patch removes code duplication (testing of error codes in all dm targets) and reports bugs in just one place, in dm_unregister_target. In some target drivers, these return codes were ignored, which could lead to a situation where bugs could be missed. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-01-05Merge branch 'for-next' of git://git.o-hand.com/linux-mfdLinus Torvalds
* 'for-next' of git://git.o-hand.com/linux-mfd: (30 commits) mfd: Fix section mismatch in da903x mfd: move drivers/i2c/chips/menelaus.c to drivers/mfd mfd: move drivers/i2c/chips/tps65010.c to drivers/mfd mfd: dm355evm msp430 driver mfd: Add missing break from wm3850-core mfd: Add WM8351 support mfd: Support configurable numbers of DCDCs and ISINKs on WM8350 mfd: Handle missing WM8350 platform data mfd: Add WM8352 support mfd: Use irq_to_desc in twl4030 code power_supply: Add Dialog DA9030 battery charger driver mfd: Dialog DA9030 battery charger MFD driver mfd: Register WM8400 codec device mfd: Pass driver_data onto child devices mfd: Fix twl4030-core.c build error mfd: twl4030 regulator bug fixes mfd: twl4030: create some regulator devices mfd: twl4030: cleanup symbols and OMAP dependency mfd: twl4030: simplified child creation code power_supply: Add battery health reporting for WM8350 ...
2009-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linusLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-for-linus: module: convert to stop_machine_create/destroy. stop_machine: introduce stop_machine_create/destroy. parisc: fix module loading failure of large kernel modules module: fix module loading failure of large kernel modules for parisc module: fix warning of unused function when !CONFIG_PROC_FS kernel/module.c: compare symbol values when marking symbols as exported in /proc/kallsyms. remove CONFIG_KMOD
2009-01-05Merge branch 'core-fixes-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: swiotlb: Don't include linux/swiotlb.h twice in lib/swiotlb.c intel-iommu: fix build error with INTR_REMAP=y and DMAR=n swiotlb: add missing __init annotations
2009-01-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (22 commits) HID: fix error condition propagation in hid-sony driver HID: fix reference count leak hidraw HID: add proper support for pensketch 12x9 tablet HID: don't allow DealExtreme usb-radio be handled by usb hid driver HID: fix default Kconfig setting for TopSpeed driver HID: driver for TopSeed Cyberlink quirky remote HID: make boot protocol drivers depend on EMBEDDED HID: avoid sparse warning in HID_COMPAT_LOAD_DRIVER HID: hiddev cleanup -- handle all error conditions properly HID: force feedback driver for GreenAsia 0x12 PID HID: switch specialized drivers from "default y" to !EMBEDDED HID: set proper dev.parent in hidraw HID: add dynids facility HID: use GFP_KERNEL in hid_alloc_buffers HID: usbhid, use usb_endpoint_xfer_int HID: move usbhid flags to usbhid.h HID: add n-trig digitizer support HID: add phys and name ioctls to hidraw HID: struct device - replace bus_id with dev_name(), dev_set_name() HID: automatically call usbhid_set_leds in usbhid driver ...
2009-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmwLinus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/gfs2-2.6-nmw: (27 commits) GFS2: Use DEFINE_SPINLOCK GFS2: Fix use-after-free bug on umount (try #2) Revert "GFS2: Fix use-after-free bug on umount" GFS2: Streamline alloc calculations for writes GFS2: Send useful information with uevent messages GFS2: Fix use-after-free bug on umount GFS2: Remove ancient, unused code GFS2: Move four functions from super.c GFS2: Fix bug in gfs2_lock_fs_check_clean() GFS2: Send some sensible sysfs stuff GFS2: Kill two daemons with one patch GFS2: Move gfs2_recoverd into recovery.c GFS2: Fix "truncate in progress" hang GFS2: Clean up & move gfs2_quotad GFS2: Add more detail to debugfs glock dumps GFS2: Banish struct gfs2_rgrpd_host GFS2: Move rg_free from gfs2_rgrpd_host to gfs2_rgrpd GFS2: Move rg_igeneration into struct gfs2_rgrpd GFS2: Banish struct gfs2_dinode_host GFS2: Move i_size from gfs2_dinode_host and rename it to i_disksize ...
2009-01-05Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (44 commits) qlge: Fix sparse warnings for tx ring indexes. qlge: Fix sparse warning regarding rx buffer queues. qlge: Fix sparse endian warning in ql_hw_csum_setup(). qlge: Fix sparse endian warning for inbound packet control block flags. qlge: Fix sparse warnings for byte swapping in qlge_ethool.c myri10ge: print MAC and serial number on probe failure pkt_sched: cls_u32: Fix locking in u32_change() iucv: fix cpu hotplug af_iucv: Free iucv path/socket in path_pending callback af_iucv: avoid left over IUCV connections from failing connects af_iucv: New error return codes for connect() net/ehea: bitops work on unsigned longs Revert "net: Fix for initial link state in 2.6.28" tcp: Kill extraneous SPLICE_F_NONBLOCK checks. tcp: don't mask EOF and socket errors on nonblocking splice receive dccp: Integrate the TFRC library with DCCP dccp: Clean up ccid.c after integration of CCID plugins dccp: Lockless integration of CCID congestion-control plugins qeth: get rid of extra argument after printk to dev_* conversion qeth: No large send using EDDP for HiperSockets. ...
2009-01-05Merge branch 'next' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/davej/cpufreq: [CPUFREQ] Fix on resume, now preserves user policy min/max. [CPUFREQ] Add Celeron Core support to p4-clockmod. [CPUFREQ] add to speedstep-lib additional fsb values for core processors [CPUFREQ] Disable sysfs ui for p4-clockmod. [CPUFREQ] p4-clockmod: reduce noise [CPUFREQ] clean up speedstep-centrino and reduce cpumask_t usage
2009-01-05Merge branch 'upstream-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 * 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2: (138 commits) ocfs2: Access the right buffer_head in ocfs2_merge_rec_left. ocfs2: use min_t in ocfs2_quota_read() ocfs2: remove unneeded lvb casts ocfs2: Add xattr support checking in init_security ocfs2: alloc xattr bucket in ocfs2_xattr_set_handle ocfs2: calculate and reserve credits for xattr value in mknod ocfs2/xattr: fix credits calculation during index create ocfs2/xattr: Always updating ctime during xattr set. ocfs2/xattr: Remove extend_trans call and add its credits from the beginning ocfs2/dlm: Fix race during lockres mastery ocfs2/dlm: Fix race in adding/removing lockres' to/from the tracking list ocfs2/dlm: Hold off sending lockres drop ref message while lockres is migrating ocfs2/dlm: Clean up errors in dlm_proxy_ast_handler() ocfs2/dlm: Fix a race between migrate request and exit domain ocfs2: One more hamming code optimization. ocfs2: Another hamming code optimization. ocfs2: Don't hand-code xor in ocfs2_hamming_encode(). ocfs2: Enable metadata checksums. ocfs2: Validate superblock with checksum and ecc. ocfs2: Checksum and ECC for directory blocks. ...
2009-01-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: inotify: fix type errors in interfaces fix breakage in reiserfs_new_inode() fix the treatment of jfs special inodes vfs: remove duplicate code in get_fs_type() add a vfs_fsync helper sys_execve and sys_uselib do not call into fsnotify zero i_uid/i_gid on inode allocation inode->i_op is never NULL ntfs: don't NULL i_op isofs check for NULL ->i_op in root directory is dead code affs: do not zero ->i_op kill suid bit only for regular files vfs: lseek(fd, 0, SEEK_CUR) race condition
2009-01-05mm lockless pagecache barrier fixNick Piggin
An XFS workload showed up a bug in the lockless pagecache patch. Basically it would go into an "infinite" loop, although it would sometimes be able to break out of the loop! The reason is a missing compiler barrier in the "increment reference count unless it was zero" case of the lockless pagecache protocol in the gang lookup functions. This would cause the compiler to use a cached value of struct page pointer to retry the operation with, rather than reload it. So the page might have been removed from pagecache and freed (refcount==0) but the lookup would not correctly notice the page is no longer in pagecache, and keep attempting to increment the refcount and failing, until the page gets reallocated for something else. This isn't a data corruption because the condition will be detected if the page has been reallocated. However it can result in a lockup. Linus points out that ACCESS_ONCE is also required in that pointer load, even if it's absence is not causing a bug on our particular build. The most general way to solve this is just to put an rcu_dereference in radix_tree_deref_slot. Assembly of find_get_pages, before: .L220: movq (%rbx), %rax #* ivtmp.1162, tmp82 movq (%rax), %rdi #, prephitmp.1149 .L218: testb $1, %dil #, prephitmp.1149 jne .L217 #, testq %rdi, %rdi # prephitmp.1149 je .L203 #, cmpq $-1, %rdi #, prephitmp.1149 je .L217 #, movl 8(%rdi), %esi # <variable>._count.counter, c testl %esi, %esi # c je .L218 #, after: .L212: movq (%rbx), %rax #* ivtmp.1109, tmp81 movq (%rax), %rdi #, ret testb $1, %dil #, ret jne .L211 #, testq %rdi, %rdi # ret je .L197 #, cmpq $-1, %rdi #, ret je .L211 #, movl 8(%rdi), %esi # <variable>._count.counter, c testl %esi, %esi # c je .L212 #, (notice the obvious infinite loop in the first example, if page->count remains 0) Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-05inotify: fix type errors in interfacesMichael Kerrisk
The problems lie in the types used for some inotify interfaces, both at the kernel level and at the glibc level. This mail addresses the kernel problem. I will follow up with some suggestions for glibc changes. For the sys_inotify_rm_watch() interface, the type of the 'wd' argument is currently 'u32', it should be '__s32' . That is Robert's suggestion, and is consistent with the other declarations of watch descriptors in the kernel source, in particular, the inotify_event structure in include/linux/inotify.h: struct inotify_event { __s32 wd; /* watch descriptor */ __u32 mask; /* watch mask */ __u32 cookie; /* cookie to synchronize two events */ __u32 len; /* length (including nulls) of name */ char name[0]; /* stub for possible name */ }; The patch makes the changes needed for inotify_rm_watch(). Signed-off-by: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: Robert Love <rlove@google.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-01-05add a vfs_fsync helperChristoph Hellwig
Fsync currently has a fdatawrite/fdatawait pair around the method call, and a mutex_lock/unlock of the inode mutex. All callers of fsync have to duplicate this, but we have a few and most of them don't quite get it right. This patch adds a new vfs_fsync that takes care of this. It's a little more complicated as usual as ->fsync might get a NULL file pointer and just a dentry from nfsd, but otherwise gets afile and we want to take the mapping and file operations from it when it is there. Notes on the fsync callers: - ecryptfs wasn't calling filemap_fdatawrite / filemap_fdatawait on the lower file - coda wasn't calling filemap_fdatawrite / filemap_fdatawait on the host file, and returning 0 when ->fsync was missing - shm wasn't calling either filemap_fdatawrite / filemap_fdatawait nor taking i_mutex. Now given that shared memory doesn't have disk backing not doing anything in fsync seems fine and I left it out of the vfs_fsync conversion for now, but in that case we might just not pass it through to the lower file at all but just call the no-op simple_sync_file directly. [and now actually export vfs_fsync] Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2009-01-05jbd2: Add buffer triggersJoel Becker
Filesystems often to do compute intensive operation on some metadata. If this operation is repeated many times, it can be very expensive. It would be much nicer if the operation could be performed once before a buffer goes to disk. This adds triggers to jbd2 buffer heads. Just before writing a metadata buffer to the journal, jbd2 will optionally call a commit trigger associated with the buffer. If the journal is aborted, an abort trigger will be called on any dirty buffers as they are dropped from pending transactions. ocfs2 will use this feature. Initially I tried to come up with a more generic trigger that could be used for non-buffer-related events like transaction completion. It doesn't tie nicely, because the information a buffer trigger needs (specific to a journal_head) isn't the same as what a transaction trigger needs (specific to a tranaction_t or perhaps journal_t). So I implemented a buffer set, with the understanding that journal/transaction wide triggers should be implemented separately. There is only one trigger set allowed per buffer. I can't think of any reason to attach more than one set. Contrast this with a journal or transaction in which multiple places may want to watch the entire transaction separately. The trigger sets are considered static allocation from the jbd2 perspective. ocfs2 will just have one trigger set per block type, setting the same set on every bh of the same type. Signed-off-by: Joel Becker <joel.becker@oracle.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: <linux-ext4@vger.kernel.org> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Export dquot_alloc() and dquot_destroy() functionsJan Kara
These are default functions for creating and destroying quota structures and they should be used from filesystems. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Unexport dqblk_v1.h and dqblk_v2.hJan Kara
Unexport header files dqblk_v[12].h since except for quota format ID they don't contain information userspace should be interested in. Move ID definitions to quota.h. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05jbd2: Add BH_JBDPrivateStartMark Fasheh
Add this so that file systems using JBD2 can safely allocate unused b_state bits. In this case, we add it so that Ocfs2 can define a single bit for tracking the validation state of a buffer. Acked-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Implement function for scanning active dquotsJan Kara
OCFS2 needs to scan all active dquots once in a while and sync quota information among cluster nodes. Provide a helper function for it so that it does not have to reimplement internally a list which VFS already has. Moreover this function is probably going to be useful for other clustered filesystems if they decide to use VFS quotas. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Add helpers to allow ocfs2 specific quota initialization, freeing and ↵Jan Kara
recovery OCFS2 needs to peek whether quota structure is already in memory so that it can avoid expensive cluster locking in that case. Similarly when freeing dquots, it checks whether it is the last quota structure user or not. Finally, it needs to get reference to dquot structure for specified id and quota type when recovering quota file after crash. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Update version numberJan Kara
Increase reported version number of quota support since quota core has changed significantly. Also remove __DQUOT_NUM_VERSION__ since nobody uses it. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Keep which entries were set by SETQUOTA quotactlJan Kara
Quota in a clustered environment needs to synchronize quota information among cluster nodes. This means we have to occasionally update some information in dquot from disk / network. On the other hand we have to be careful not to overwrite changes administrator did via SETQUOTA. So indicate in dquot->dq_flags which entries have been set by SETQUOTA and quota format can clear these flags when it properly propagated the changes. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Allow negative usage of space and inodesJan Kara
For clustered filesystems, it can happen that space / inode usage goes negative temporarily (because some node is allocating another node is freeing and they are not completely in sync). So let quota code allow this and change qsize_t so a signed type so that we don't underflow the variables. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Convert union in mem_dqinfo to a pointerJan Kara
Coming quota support for OCFS2 is going to need quite a bit of additional per-sb quota information. Moreover having fs.h include all the types needed for this structure would be a pain in the a**. So remove the union from mem_dqinfo and add a private pointer for filesystem's use. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>
2009-01-05quota: Split off quota tree handling into a separate fileJan Kara
There is going to be a new version of quota format having 64-bit quota limits and a new quota format for OCFS2. They are both going to use the same tree structure as VFSv0 quota format. So split out tree handling into a separate file and make size of leaf blocks, amount of space usable in each block (needed for checksumming) and structures contained in them configurable so that the code can be shared. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Mark Fasheh <mfasheh@suse.com>