aboutsummaryrefslogtreecommitdiff
path: root/include
AgeCommit message (Collapse)Author
2015-03-26workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for ↵Tejun Heo
PREEMPT_NONE commit 8603e1b30027f943cc9c1eef2b291d42c3347af1 upstream. cancel[_delayed]_work_sync() are implemented using __cancel_work_timer() which grabs the PENDING bit using try_to_grab_pending() and then flushes the work item with PENDING set to prevent the on-going execution of the work item from requeueing itself. try_to_grab_pending() can always grab PENDING bit without blocking except when someone else is doing the above flushing during cancelation. In that case, try_to_grab_pending() returns -ENOENT. In this case, __cancel_work_timer() currently invokes flush_work(). The assumption is that the completion of the work item is what the other canceling task would be waiting for too and thus waiting for the same condition and retrying should allow forward progress without excessive busy looping Unfortunately, this doesn't work if preemption is disabled or the latter task has real time priority. Let's say task A just got woken up from flush_work() by the completion of the target work item. If, before task A starts executing, task B gets scheduled and invokes __cancel_work_timer() on the same work item, its try_to_grab_pending() will return -ENOENT as the work item is still being canceled by task A and flush_work() will also immediately return false as the work item is no longer executing. This puts task B in a busy loop possibly preventing task A from executing and clearing the canceling state on the work item leading to a hang. task A task B worker executing work __cancel_work_timer() try_to_grab_pending() set work CANCELING flush_work() block for work completion completion, wakes up A __cancel_work_timer() while (forever) { try_to_grab_pending() -ENOENT as work is being canceled flush_work() false as work is no longer executing } This patch removes the possible hang by updating __cancel_work_timer() to explicitly wait for clearing of CANCELING rather than invoking flush_work() after try_to_grab_pending() fails with -ENOENT. Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com v3: bit_waitqueue() can't be used for work items defined in vmalloc area. Switched to custom wake function which matches the target work item and exclusive wait and wakeup. v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if the target bit waitqueue has wait_bit_queue's on it. Use DEFINE_WAIT_BIT() and __wake_up_bit() instead. Reported by Tomeu Vizoso. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Rabin Vincent <rabin.vincent@axis.com> Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com> Tested-by: Jesper Nilsson <jesper.nilsson@axis.com> Tested-by: Rabin Vincent <rabin.vincent@axis.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18target: Fix PR_APTPL_BUF_LEN buffer size limitationNicholas Bellinger
commit f161d4b44d7cc1dc66b53365215227db356378b1 upstream. This patch addresses the original PR_APTPL_BUF_LEN = 8k limitiation for write-out of PR APTPL metadata that Martin has recently been running into. It changes core_scsi3_update_and_write_aptpl() to use vzalloc'ed memory instead of kzalloc, and increases the default hardcoded length to 256k. It also adds logic in core_scsi3_update_and_write_aptpl() to double the original length upon core_scsi3_update_aptpl_buf() failure, and retries until the vzalloc'ed buffer is large enough to accommodate the outgoing APTPL metadata. Reported-by: Martin Svec <martin.svec@zoner.cz> Signed-off-by: Nicholas Bellinger <nab@linux-iscsi.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-18mm: when stealing freepages, also take pages created by splitting buddy pageVlastimil Babka
commit 99592d598eca62bdbbf62b59941c189176dfc614 upstream. When studying page stealing, I noticed some weird looking decisions in try_to_steal_freepages(). The first I assume is a bug (Patch 1), the following two patches were driven by evaluation. Testing was done with stress-highalloc of mmtests, using the mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how often page stealing occurs for individual migratetypes, and what migratetypes are used for fallbacks. Arguably, the worst case of page stealing is when UNMOVABLE allocation steals from MOVABLE pageblock. RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal, so the goal is to minimize these two cases. The evaluation of v2 wasn't always clear win and Joonsoo questioned the results. Here I used different baseline which includes RFC compaction improvements from [1]. I found that the compaction improvements reduce variability of stress-highalloc, so there's less noise in the data. First, let's look at stress-highalloc configured to do sync compaction, and how these patches reduce page stealing events during the test. First column is after fresh reboot, other two are reiterations of test without reboot. That was all accumulater over 5 re-iterations (so the benchmark was run 5x3 times with 5 fresh restarts). Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Page alloc extfrag event 10264225 8702233 10244125 Extfrag fragmenting 10263271 8701552 10243473 Extfrag fragmenting for unmovable 13595 17616 15960 Extfrag fragmenting unmovable placed with movable 7989 12193 8447 Extfrag fragmenting for reclaimable 658 1840 1817 Extfrag fragmenting reclaimable placed with movable 558 1677 1679 Extfrag fragmenting for movable 10249018 8682096 10225696 With Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Page alloc extfrag event 11834954 9877523 9774860 Extfrag fragmenting 11833993 9876880 9774245 Extfrag fragmenting for unmovable 7342 16129 11712 Extfrag fragmenting unmovable placed with movable 4191 10547 6270 Extfrag fragmenting for reclaimable 373 1130 923 Extfrag fragmenting reclaimable placed with movable 302 906 738 Extfrag fragmenting for movable 11826278 9859621 9761610 With Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Page alloc extfrag event 4725990 3668793 3807436 Extfrag fragmenting 4725104 3668252 3806898 Extfrag fragmenting for unmovable 6678 7974 7281 Extfrag fragmenting unmovable placed with movable 2051 3829 4017 Extfrag fragmenting for reclaimable 429 1208 1278 Extfrag fragmenting reclaimable placed with movable 369 976 1034 Extfrag fragmenting for movable 4717997 3659070 3798339 With Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Page alloc extfrag event 5016183 4700142 3850633 Extfrag fragmenting 5015325 4699613 3850072 Extfrag fragmenting for unmovable 1312 3154 3088 Extfrag fragmenting unmovable placed with movable 1115 2777 2714 Extfrag fragmenting for reclaimable 437 1193 1097 Extfrag fragmenting reclaimable placed with movable 330 969 879 Extfrag fragmenting for movable 5013576 4695266 3845887 In v2 we've seen apparent regression with Patch 1 for unmovable events, this is now gone, suggesting it was indeed noise. Here, each patch improves the situation for unmovable events. Reclaimable is improved by patch 1 and then either the same modulo noise, or perhaps sligtly worse - a small price for unmovable improvements, IMHO. The number of movable allocations falling back to other migratetypes is most noisy, but it's reduced to half at Patch 2 nevertheless. These are least critical as compaction can move them around. If we look at success rates, the patches don't affect them, that didn't change. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-nothp-1 5-nothp-2 5-nothp-3 Success 1 Min 49.00 ( 0.00%) 42.00 ( 14.29%) 41.00 ( 16.33%) Success 1 Mean 51.00 ( 0.00%) 45.00 ( 11.76%) 42.60 ( 16.47%) Success 1 Max 55.00 ( 0.00%) 51.00 ( 7.27%) 46.00 ( 16.36%) Success 2 Min 53.00 ( 0.00%) 47.00 ( 11.32%) 44.00 ( 16.98%) Success 2 Mean 59.60 ( 0.00%) 50.80 ( 14.77%) 48.20 ( 19.13%) Success 2 Max 64.00 ( 0.00%) 56.00 ( 12.50%) 52.00 ( 18.75%) Success 3 Min 84.00 ( 0.00%) 82.00 ( 2.38%) 78.00 ( 7.14%) Success 3 Mean 85.60 ( 0.00%) 82.80 ( 3.27%) 79.40 ( 7.24%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-nothp-1 6-nothp-2 6-nothp-3 Success 1 Min 49.00 ( 0.00%) 44.00 ( 10.20%) 44.00 ( 10.20%) Success 1 Mean 51.80 ( 0.00%) 46.00 ( 11.20%) 45.80 ( 11.58%) Success 1 Max 54.00 ( 0.00%) 49.00 ( 9.26%) 49.00 ( 9.26%) Success 2 Min 58.00 ( 0.00%) 49.00 ( 15.52%) 48.00 ( 17.24%) Success 2 Mean 60.40 ( 0.00%) 51.80 ( 14.24%) 50.80 ( 15.89%) Success 2 Max 63.00 ( 0.00%) 54.00 ( 14.29%) 55.00 ( 12.70%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 81.60 ( 4.00%) 79.80 ( 6.12%) Success 3 Max 86.00 ( 0.00%) 82.00 ( 4.65%) 82.00 ( 4.65%) Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-nothp-1 7-nothp-2 7-nothp-3 Success 1 Min 50.00 ( 0.00%) 44.00 ( 12.00%) 39.00 ( 22.00%) Success 1 Mean 52.80 ( 0.00%) 45.60 ( 13.64%) 42.40 ( 19.70%) Success 1 Max 55.00 ( 0.00%) 46.00 ( 16.36%) 47.00 ( 14.55%) Success 2 Min 52.00 ( 0.00%) 48.00 ( 7.69%) 45.00 ( 13.46%) Success 2 Mean 53.40 ( 0.00%) 49.80 ( 6.74%) 48.80 ( 8.61%) Success 2 Max 57.00 ( 0.00%) 52.00 ( 8.77%) 52.00 ( 8.77%) Success 3 Min 84.00 ( 0.00%) 81.00 ( 3.57%) 79.00 ( 5.95%) Success 3 Mean 85.00 ( 0.00%) 82.40 ( 3.06%) 79.60 ( 6.35%) Success 3 Max 86.00 ( 0.00%) 83.00 ( 3.49%) 80.00 ( 6.98%) Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-nothp-1 8-nothp-2 8-nothp-3 Success 1 Min 46.00 ( 0.00%) 44.00 ( 4.35%) 42.00 ( 8.70%) Success 1 Mean 50.20 ( 0.00%) 45.60 ( 9.16%) 44.00 ( 12.35%) Success 1 Max 52.00 ( 0.00%) 47.00 ( 9.62%) 47.00 ( 9.62%) Success 2 Min 53.00 ( 0.00%) 49.00 ( 7.55%) 48.00 ( 9.43%) Success 2 Mean 55.80 ( 0.00%) 50.60 ( 9.32%) 49.00 ( 12.19%) Success 2 Max 59.00 ( 0.00%) 52.00 ( 11.86%) 51.00 ( 13.56%) Success 3 Min 84.00 ( 0.00%) 80.00 ( 4.76%) 79.00 ( 5.95%) Success 3 Mean 85.40 ( 0.00%) 81.60 ( 4.45%) 80.40 ( 5.85%) Success 3 Max 87.00 ( 0.00%) 83.00 ( 4.60%) 82.00 ( 5.75%) While there's no improvement here, I consider reduced fragmentation events to be worth on its own. Patch 2 also seems to reduce scanning for free pages, and migrations in compaction, suggesting it has somewhat less work to do: Patch 1: Compaction stalls 4153 3959 3978 Compaction success 1523 1441 1446 Compaction failures 2630 2517 2531 Page migrate success 4600827 4943120 5104348 Page migrate failure 19763 16656 17806 Compaction pages isolated 9597640 10305617 10653541 Compaction migrate scanned 77828948 86533283 87137064 Compaction free scanned 517758295 521312840 521462251 Compaction cost 5503 5932 6110 Patch 2: Compaction stalls 3800 3450 3518 Compaction success 1421 1316 1317 Compaction failures 2379 2134 2201 Page migrate success 4160421 4502708 4752148 Page migrate failure 19705 14340 14911 Compaction pages isolated 8731983 9382374 9910043 Compaction migrate scanned 98362797 96349194 98609686 Compaction free scanned 496512560 469502017 480442545 Compaction cost 5173 5526 5811 As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers of unmovable and reclaimable pageblocks. Configuring the benchmark to allocate like THP page fault (i.e. no sync compaction) gives much noisier results for iterations 2 and 3 after reboot. This is not so surprising given how [1] offers lower improvements in this scenario due to less restarts after deferred compaction which would change compaction pivot. Baseline: 3.19-rc4 3.19-rc4 3.19-rc4 5-thp-1 5-thp-2 5-thp-3 Page alloc extfrag event 8148965 6227815 6646741 Extfrag fragmenting 8147872 6227130 6646117 Extfrag fragmenting for unmovable 10324 12942 15975 Extfrag fragmenting unmovable placed with movable 5972 8495 10907 Extfrag fragmenting for reclaimable 601 1707 2210 Extfrag fragmenting reclaimable placed with movable 520 1570 2000 Extfrag fragmenting for movable 8136947 6212481 6627932 Patch 1: 3.19-rc4 3.19-rc4 3.19-rc4 6-thp-1 6-thp-2 6-thp-3 Page alloc extfrag event 8345457 7574471 7020419 Extfrag fragmenting 8343546 7573777 7019718 Extfrag fragmenting for unmovable 10256 18535 30716 Extfrag fragmenting unmovable placed with movable 6893 11726 22181 Extfrag fragmenting for reclaimable 465 1208 1023 Extfrag fragmenting reclaimable placed with movable 353 996 843 Extfrag fragmenting for movable 8332825 7554034 6987979 Patch 2: 3.19-rc4 3.19-rc4 3.19-rc4 7-thp-1 7-thp-2 7-thp-3 Page alloc extfrag event 3512847 3020756 2891625 Extfrag fragmenting 3511940 3020185 2891059 Extfrag fragmenting for unmovable 9017 6892 6191 Extfrag fragmenting unmovable placed with movable 1524 3053 2435 Extfrag fragmenting for reclaimable 445 1081 1160 Extfrag fragmenting reclaimable placed with movable 375 918 986 Extfrag fragmenting for movable 3502478 3012212 2883708 Patch 3: 3.19-rc4 3.19-rc4 3.19-rc4 8-thp-1 8-thp-2 8-thp-3 Page alloc extfrag event 3181699 3082881 2674164 Extfrag fragmenting 3180812 3082303 2673611 Extfrag fragmenting for unmovable 1201 4031 4040 Extfrag fragmenting unmovable placed with movable 974 3611 3645 Extfrag fragmenting for reclaimable 478 1165 1294 Extfrag fragmenting reclaimable placed with movable 387 985 1030 Extfrag fragmenting for movable 3179133 3077107 2668277 The improvements for first iteration are clear, the rest is much noisier and can appear like regression for Patch 1. Anyway, patch 2 rectifies it. Allocation success rates are again unaffected so there's no point in making this e-mail any longer. [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2 This patch (of 3): When __rmqueue_fallback() is called to allocate a page of order X, it will find a page of order Y >= X of a fallback migratetype, which is different from the desired migratetype. With the help of try_to_steal_freepages(), it may change the migratetype (to the desired one) also of: 1) all currently free pages in the pageblock containing the fallback page 2) the fallback pageblock itself 3) buddy pages created by splitting the fallback page (when Y > X) These decisions take the order Y into account, as well as the desired migratetype, with the goal of preventing multiple fallback allocations that could e.g. distribute UNMOVABLE allocations among multiple pageblocks. Originally, decision for 1) has implied the decision for 3). Commit 47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") changed that (probably unintentionally) so that the buddy pages in case 3) are always changed to the desired migratetype, except for CMA pageblocks. Commit fef903efcf0c ("mm/page_allo.c: restructure free-page stealing code and fix a bug") did some refactoring and added a comment that the case of 3) is intended. Commit 0cbef29a7821 ("mm: __rmqueue_fallback() should respect pageblock type") removed the comment and tried to restore the original behavior where 1) implies 3), but due to the previous refactoring, the result is instead that only 2) implies 3) - and the conditions for 2) are less frequently met than conditions for 1). This may increase fragmentation in situations where the code decides to steal all free pages from the pageblock (case 1)), but then gives back the buddy pages produced by splitting. This patch restores the original intended logic where 1) implies 3). During testing with stress-highalloc from mmtests, this has shown to decrease the number of events where UNMOVABLE and RECLAIMABLE allocations steal from MOVABLE pageblocks, which can lead to permanent fragmentation. In some cases it has increased the number of events when MOVABLE allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are fixable by sync compaction and thus less harmful. Note that evaluation has shown that the behavior introduced by 47118af076f6 for buddy pages in case 3) is actually even better than the original logic, so the following patch will introduce it properly once again. For stable backports of this patch it makes thus sense to only fix versions containing 0cbef29a7821. [iamjoonsoo.kim@lge.com: tracepoint fix] Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06usb: core: buffer: smallest buffer should start at ARCH_DMA_MINALIGNSebastian Andrzej Siewior
commit 5efd2ea8c9f4f12916ffc8ba636792ce052f6911 upstream. the following error pops up during "testusb -a -t 10" | musb-hdrc musb-hdrc.1.auto: dma_pool_free buffer-128, f134e000/be842000 (bad dma) hcd_buffer_create() creates a few buffers, the smallest has 32 bytes of size. ARCH_KMALLOC_MINALIGN is set to 64 bytes. This combo results in hcd_buffer_alloc() returning memory which is 32 bytes aligned and it might by identified by buffer_offset() as another buffer. This means the buffer which is on a 32 byte boundary will not get freed, instead it tries to free another buffer with the error message. This patch fixes the issue by creating the smallest DMA buffer with the size of ARCH_KMALLOC_MINALIGN (or 32 in case ARCH_KMALLOC_MINALIGN is smaller). This might be 32, 64 or even 128 bytes. The next three pools will have the size 128, 512 and 2048. In case the smallest pool is 128 bytes then we have only three pools instead of four (and zero the first entry in the array). The last pool size is always 2048 bytes which is the assumed PAGE_SIZE / 2 of 4096. I doubt it makes sense to continue using PAGE_SIZE / 2 where we would end up with 8KiB buffer in case we have 16KiB pages. Instead I think it makes sense to have a common size(s) and extend them if there is need to. There is a BUILD_BUG_ON() now in case someone has a minalign of more than 128 bytes. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-03-06fsnotify: fix handling of renames in auditJan Kara
commit 6ee8e25fc3e916193bce4ebb43d5439e1e2144ab upstream. Commit e9fd702a58c4 ("audit: convert audit watches to use fsnotify instead of inotify") broke handling of renames in audit. Audit code wants to update inode number of an inode corresponding to watched name in a directory. When something gets renamed into a directory to a watched name, inotify previously passed moved inode to audit code however new fsnotify code passes directory inode where the change happened. That confuses audit and it starts watching parent directory instead of a file in a directory. This can be observed for example by doing: cd /tmp touch foo bar auditctl -w /tmp/foo touch foo mv bar foo touch foo In audit log we see events like: type=CONFIG_CHANGE msg=audit(1423563584.155:90): auid=1000 ses=2 op="updated rules" path="/tmp/foo" key=(null) list=4 res=1 ... type=PATH msg=audit(1423563584.155:91): item=2 name="bar" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE type=PATH msg=audit(1423563584.155:91): item=3 name="foo" inode=1046842 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=DELETE type=PATH msg=audit(1423563584.155:91): item=4 name="foo" inode=1046884 dev=08:0 2 mode=0100644 ouid=0 ogid=0 rdev=00:00 nametype=CREATE ... and that's it - we see event for the first touch after creating the audit rule, we see events for rename but we don't see any event for the last touch. However we start seeing events for unrelated stuff happening in /tmp. Fix the problem by passing moved inode as data in the FS_MOVED_FROM and FS_MOVED_TO events instead of the directory where the change happens. This doesn't introduce any new problems because noone besides audit_watch.c cares about the passed value: fs/notify/fanotify/fanotify.c cares only about FSNOTIFY_EVENT_PATH events. fs/notify/dnotify/dnotify.c doesn't care about passed 'data' value at all. fs/notify/inotify/inotify_fsnotify.c uses 'data' only for FSNOTIFY_EVENT_PATH. kernel/audit_tree.c doesn't care about passed 'data' at all. kernel/audit_watch.c expects moved inode as 'data'. Fixes: e9fd702a58c49db ("audit: convert audit watches to use fsnotify instead of inotify") Signed-off-by: Jan Kara <jack@suse.cz> Cc: Paul Moore <paul@paul-moore.com> Cc: Eric Paris <eparis@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-26ipv4: tcp: get rid of ugly unicast_sockEric Dumazet
[ Upstream commit bdbbb8527b6f6a358dbcb70dac247034d665b8e4 ] In commit be9f4a44e7d41 ("ipv4: tcp: remove per net tcp_sock") I tried to address contention on a socket lock, but the solution I chose was horrible : commit 3a7c384ffd57e ("ipv4: tcp: unicast_sock should not land outside of TCP stack") addressed a selinux regression. commit 0980e56e506b ("ipv4: tcp: set unicast_sock uc_ttl to -1") took care of another regression. commit b5ec8eeac46 ("ipv4: fix ip_send_skb()") fixed another regression. commit 811230cd85 ("tcp: ipv4: initialize unicast_sock sk_pacing_rate") was another shot in the dark. Really, just use a proper socket per cpu, and remove the skb_orphan() call, to re-enable flow control. This solves a serious problem with FQ packet scheduler when used in hostile environments, as we do not want to allocate a flow structure for every RST packet sent in response to a spoofed packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-26ipv4: try to cache dst_entries which would cause a redirectHannes Frederic Sowa
[ Upstream commit df4d92549f23e1c037e83323aff58a21b3de7fe0 ] Not caching dst_entries which cause redirects could be exploited by hosts on the same subnet, causing a severe DoS attack. This effect aggravated since commit f88649721268999 ("ipv4: fix dst race in sk_dst_get()"). Lookups causing redirects will be allocated with DST_NOCACHE set which will force dst_release to free them via RCU. Unfortunately waiting for RCU grace period just takes too long, we can end up with >1M dst_entries waiting to be released and the system will run OOM. rcuos threads cannot catch up under high softirq load. Attaching the flag to emit a redirect later on to the specific skb allows us to cache those dst_entries thus reducing the pressure on allocation and deallocation. This issue was discovered by Marcelo Leitner. Cc: Julian Anastasov <ja@ssi.bg> Signed-off-by: Marcelo Leitner <mleitner@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-02-11ALSA: ak411x: Fix stall in work callbackTakashi Iwai
commit 4161b4505f1690358ac0a9ee59845a7887336b21 upstream. When ak4114 work calls its callback and the callback invokes ak4114_reinit(), it stalls due to flush_delayed_work(). For avoiding this, control the reentrance by introducing a refcount. Also flush_delayed_work() is replaced with cancel_delayed_work_sync(). The exactly same bug is present in ak4113.c and fixed as well. Reported-by: Pavel Hofman <pavel.hofman@ivitera.com> Acked-by: Jaroslav Kysela <perex@perex.cz> Tested-by: Pavel Hofman <pavel.hofman@ivitera.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: get rid of radix tree gfp mask for pagecache_get_pageMichal Hocko
commit 45f87de57f8fad59302fd263dd81ffa4843b5b24 upstream. Commit 2457aec63745 ("mm: non-atomically mark page accessed during page cache allocation where possible") has added a separate parameter for specifying gfp mask for radix tree allocations. Not only this is less than optimal from the API point of view because it is error prone, it is also buggy currently because grab_cache_page_write_begin is using GFP_KERNEL for radix tree and if fgp_flags doesn't contain FGP_NOFS (mostly controlled by fs by AOP_FLAG_NOFS flag) but the mapping_gfp_mask has __GFP_FS cleared then the radix tree allocation wouldn't obey the restriction and might recurse into filesystem and cause deadlocks. This is the case for most filesystems unfortunately because only ext4 and gfs2 are using AOP_FLAG_NOFS. Let's simply remove radix_gfp_mask parameter because the allocation context is same for both page cache and for the radix tree. Just make sure that the radix tree gets only the sane subset of the mask (e.g. do not pass __GFP_WRITE). Long term it is more preferable to convert remaining users of AOP_FLAG_NOFS to use mapping_gfp_mask instead and simplify this interface even further. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: reduce cost of the fair zone allocation policyMel Gorman
commit 4ffeaf3560a52b4a69cc7909873d08c0ef5909d4 upstream. The fair zone allocation policy round-robins allocations between zones within a node to avoid age inversion problems during reclaim. If the first allocation fails, the batch counts are reset and a second attempt made before entering the slow path. One assumption made with this scheme is that batches expire at roughly the same time and the resets each time are justified. This assumption does not hold when zones reach their low watermark as the batches will be consumed at uneven rates. Allocation failure due to watermark depletion result in additional zonelist scans for the reset and another watermark check before hitting the slowpath. On UMA, the benefit is negligible -- around 0.25%. On 4-socket NUMA machine it's variable due to the variability of measuring overhead with the vmstat changes. The system CPU overhead comparison looks like 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanilla vmstat-v5 lowercost-v5 User 746.94 774.56 802.00 System 65336.22 32847.27 40852.33 Elapsed 27553.52 27415.04 27368.46 However it is worth noting that the overall benchmark still completed faster and intuitively it makes sense to take as few passes as possible through the zonelists. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: move zone->pages_scanned into a vmstat counterMel Gorman
commit 0d5d823ab4e608ec7b52ac4410de4cb74bbe0edd upstream. zone->pages_scanned is a write-intensive cache line during page reclaim and it's also updated during page free. Move the counter into vmstat to take advantage of the per-cpu updates and do not update it in the free paths unless necessary. On a small UMA machine running tiobench the difference is marginal. On a 4-node machine the overhead is more noticable. Note that automatic NUMA balancing was disabled for this test as otherwise the system CPU overhead is unpredictable. 3.16.0-rc3 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5 vmstat-v5 User 746.94 759.78 774.56 System 65336.22 58350.98 32847.27 Elapsed 27553.52 27282.02 27415.04 Note that the overhead reduction will vary depending on where exactly pages are allocated and freed. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: rearrange zone fields into read-only, page alloc, statistics and page ↵Mel Gorman
reclaim lines commit 3484b2de9499df23c4604a513b36f96326ae81ad upstream. The arrangement of struct zone has changed over time and now it has reached the point where there is some inappropriate sharing going on. On x86-64 for example o The zone->node field is shared with the zone lock and zone->node is accessed frequently from the page allocator due to the fair zone allocation policy. o span_seqlock is almost never used by shares a line with free_area o Some zone statistics share a cache line with the LRU lock so reclaim-intensive and allocator-intensive workloads can bounce the cache line on a stat update This patch rearranges struct zone to put read-only and read-mostly fields together and then splits the page allocator intensive fields, the zone statistics and the page reclaim intensive fields into their own cache lines. Note that the type of lowmem_reserve changes due to the watermark calculations being signed and avoiding a signed/unsigned conversion there. On the test configuration I used the overall size of struct zone shrunk by one cache line. On smaller machines, this is not likely to be noticable. However, on a 4-node NUMA machine running tiobench the system CPU overhead is reduced by this patch. 3.16.0-rc3 3.16.0-rc3 vanillarearrange-v5r9 User 746.94 759.78 System 65336.22 58350.98 Elapsed 27553.52 27282.02 Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: pagemap: avoid unnecessary overhead when tracepoints are deactivatedMel Gorman
commit 24b7e5819ad5cbef2b7c7376510862aa8319d240 upstream. This was formerly the series "Improve sequential read throughput" which noted some major differences in performance of tiobench since 3.0. While there are a number of factors, two that dominated were the introduction of the fair zone allocation policy and changes to CFQ. The behaviour of fair zone allocation policy makes more sense than tiobench as a benchmark and CFQ defaults were not changed due to insufficient benchmarking. This series is what's left. It's one functional fix to the fair zone allocation policy when used on NUMA machines and a reduction of overhead in general. tiobench was used for the comparison despite its flaws as an IO benchmark as in this case we are primarily interested in the overhead of page allocator and page reclaim activity. On UMA, it makes little difference to overhead 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 383.61 386.77 System 403.83 401.74 Elapsed 5411.50 5413.11 On a 4-socket NUMA machine it's a bit more noticable 3.16.0-rc3 3.16.0-rc3 vanilla lowercost-v5 User 746.94 802.00 System 65336.22 40852.33 Elapsed 27553.52 27368.46 This patch (of 6): The LRU insertion and activate tracepoints take PFN as a parameter forcing the overhead to the caller. Move the overhead to the tracepoint fast-assign method to ensure the cost is only incurred when the tracepoint is active. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: make copy_pte_range static againJerome Marchand
commit 21bda264f4243f61dfcc485174055f12ad0530b4 upstream. Commit 71e3aac0724f ("thp: transparent hugepage core") adds copy_pte_range prototype to huge_mm.h. I'm not sure why (or if) this function have been used outside of memory.c, but it currently isn't. This patch makes copy_pte_range() static again. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: non-atomically mark page accessed during page cache allocation where ↵Mel Gorman
possible commit 2457aec63745e235bcafb7ef312b182d8682f0fc upstream. aops->write_begin may allocate a new page and make it visible only to have mark_page_accessed called almost immediately after. Once the page is visible the atomic operations are necessary which is noticable overhead when writing to an in-memory filesystem like tmpfs but should also be noticable with fast storage. The objective of the patch is to initialse the accessed information with non-atomic operations before the page is visible. The bulk of filesystems directly or indirectly use grab_cache_page_write_begin or find_or_create_page for the initial allocation of a page cache page. This patch adds an init_page_accessed() helper which behaves like the first call to mark_page_accessed() but may called before the page is visible and can be done non-atomically. The primary APIs of concern in this care are the following and are used by most filesystems. find_get_page find_lock_page find_or_create_page grab_cache_page_nowait grab_cache_page_write_begin All of them are very similar in detail to the patch creates a core helper pagecache_get_page() which takes a flags parameter that affects its behavior such as whether the page should be marked accessed or not. Then old API is preserved but is basically a thin wrapper around this core function. Each of the filesystems are then updated to avoid calling mark_page_accessed when it is known that the VM interfaces have already done the job. There is a slight snag in that the timing of the mark_page_accessed() has now changed so in rare cases it's possible a page gets to the end of the LRU as PageReferenced where as previously it might have been repromoted. This is expected to be rare but it's worth the filesystem people thinking about it in case they see a problem with the timing change. It is also the case that some filesystems may be marking pages accessed that previously did not but it makes sense that filesystems have consistent behaviour in this regard. The test case used to evaulate this is a simple dd of a large file done multiple times with the file deleted on each iterations. The size of the file is 1/10th physical memory to avoid dirty page balancing. In the async case it will be possible that the workload completes without even hitting the disk and will have variable results but highlight the impact of mark_page_accessed for async IO. The sync results are expected to be more stable. The exception is tmpfs where the normal case is for the "IO" to not hit the disk. The test machine was single socket and UMA to avoid any scheduling or NUMA artifacts. Throughput and wall times are presented for sync IO, only wall times are shown for async as the granularity reported by dd and the variability is unsuitable for comparison. As async results were variable do to writback timings, I'm only reporting the maximum figures. The sync results were stable enough to make the mean and stddev uninteresting. The performance results are reported based on a run with no profiling. Profile data is based on a separate run with oprofile running. async dd 3.15.0-rc3 3.15.0-rc3 vanilla accessed-v2 ext3 Max elapsed 13.9900 ( 0.00%) 11.5900 ( 17.16%) tmpfs Max elapsed 0.5100 ( 0.00%) 0.4900 ( 3.92%) btrfs Max elapsed 12.8100 ( 0.00%) 12.7800 ( 0.23%) ext4 Max elapsed 18.6000 ( 0.00%) 13.3400 ( 28.28%) xfs Max elapsed 12.5600 ( 0.00%) 2.0900 ( 83.36%) The XFS figure is a bit strange as it managed to avoid a worst case by sheer luck but the average figures looked reasonable. samples percentage ext3 86107 0.9783 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext3 23833 0.2710 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext3 5036 0.0573 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed ext4 64566 0.8961 vmlinux-3.15.0-rc4-vanilla mark_page_accessed ext4 5322 0.0713 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed ext4 2869 0.0384 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 62126 1.7675 vmlinux-3.15.0-rc4-vanilla mark_page_accessed xfs 1904 0.0554 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed xfs 103 0.0030 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed btrfs 10655 0.1338 vmlinux-3.15.0-rc4-vanilla mark_page_accessed btrfs 2020 0.0273 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed btrfs 587 0.0079 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed tmpfs 59562 3.2628 vmlinux-3.15.0-rc4-vanilla mark_page_accessed tmpfs 1210 0.0696 vmlinux-3.15.0-rc4-accessed-v3r25 init_page_accessed tmpfs 94 0.0054 vmlinux-3.15.0-rc4-accessed-v3r25 mark_page_accessed [akpm@linux-foundation.org: don't run init_page_accessed() against an uninitialised pointer] Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Tested-by: Prabhakar Lad <prabhakar.csengg@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: shmem: avoid atomic operation during shmem_getpage_gfpMel Gorman
commit 07a427884348d38a6fd56fa4d78249c407196650 upstream. shmem_getpage_gfp uses an atomic operation to set the SwapBacked field before it's even added to the LRU or visible. This is unnecessary as what could it possible race against? Use an unlocked variant. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Rik van Riel <riel@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: convert hot/cold parameter and immediate callers to boolMel Gorman
commit b745bc85f21ea707e4ea1a91948055fa3e72c77b upstream. cold is a bool, make it one. Make the likely case the "if" part of the block instead of the else as according to the optimisation manual this is preferred. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: reduce number of times page_to_pfn is calledMel Gorman
commit dc4b0caff24d9b2918e9f27bc65499ee63187eba upstream. In the free path we calculate page_to_pfn multiple times. Reduce that. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: use unsigned int for order in more placesMel Gorman
commit 7aeb09f9104b760fc53c98cb7d20d06640baf9e6 upstream. X86 prefers the use of unsigned types for iterators and there is a tendency to mix whether a signed or unsigned type if used for page order. This converts a number of sites in mm/page_alloc.c to use unsigned int for order where possible. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm: page_alloc: use jump labels to avoid checking number_of_cpusetsMel Gorman
commit 664eeddeef6539247691197c1ac124d4aa872ab6 upstream. If cpusets are not in use then we still check a global variable on every page allocation. Use jump labels to avoid the overhead. Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29include/linux/jump_label.h: expose the reference countMel Gorman
commit ea5e9539abf1258f23e725cb9cb25aa74efa29eb upstream. This patch exposes the jump_label reference count in preparation for the next patch. cpusets cares about both the jump_label being enabled and how many users of the cpusets there currently are. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29mm/swap.c: clean up *lru_cache_add* functionsJianyu Zhan
commit 2329d3751b082b4fd354f334a88662d72abac52d upstream. In mm/swap.c, __lru_cache_add() is exported, but actually there are no users outside this file. This patch unexports __lru_cache_add(), and makes it static. It also exports lru_cache_add_file(), as it is use by cifs and fuse, which can loaded as modules. Signed-off-by: Jianyu Zhan <nasa4836@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shaohua Li <shli@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Cc: Seth Jennings <sjenning@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Rafael Aquini <aquini@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Khalid Aziz <khalid.aziz@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29crypto: prefix module autoloading with "crypto-"Kees Cook
commit 5d26a105b5a73e5635eae0629b42fa0a90e07b7b upstream. This prefixes all crypto module loading with "crypto-" so we never run the risk of exposing module auto-loading to userspace via a crypto API, as demonstrated by Mathias Krause: https://lkml.org/lkml/2013/3/4/70 Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29time: settimeofday: Validate the values of tv from userSasha Levin
commit 6ada1fc0e1c4775de0e043e1bd3ae9d065491aa5 upstream. An unvalidated user input is multiplied by a constant, which can result in an undefined behaviour for large values. While this is validated later, we should avoid triggering undefined behaviour. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> [jstultz: include trivial milisecond->microsecond correction noticed by Andy] Signed-off-by: John Stultz <john.stultz@linaro.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-29PCI: Add flag for devices where we can't use bus resetAlex Williamson
commit f331a859e0ee5a898c1f47596eddad4c4f02d657 upstream. Enable a mechanism for devices to quirk that they do not behave when doing a PCI bus reset. We require a modest level of spec compliant behavior in order to do a reset, for instance the device should come out of reset without throwing errors and PCI config space should be accessible after reset. This is too much to ask for some devices. Link: http://lkml.kernel.org/r/20140923210318.498dacbd@dualc.maya.org Signed-off-by: Alex Williamson <alex.williamson@redhat.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27genirq: Prevent proc race against freeing of irq descriptorsThomas Gleixner
commit c291ee622165cb2c8d4e7af63fffd499354a23be upstream. Since the rework of the sparse interrupt code to actually free the unused interrupt descriptors there exists a race between the /proc interfaces to the irq subsystem and the code which frees the interrupt descriptor. CPU0 CPU1 show_interrupts() desc = irq_to_desc(X); free_desc(desc) remove_from_radix_tree(); kfree(desc); raw_spinlock_irq(&desc->lock); /proc/interrupts is the only interface which can actively corrupt kernel memory via the lock access. /proc/stat can only read from freed memory. Extremly hard to trigger, but possible. The interfaces in /proc/irq/N/ are not affected by this because the removal of the proc file is serialized in procfs against concurrent readers/writers. The removal happens before the descriptor is freed. For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue as the descriptor is never freed. It's merely cleared out with the irq descriptor lock held. So any concurrent proc access will either see the old correct value or the cleared out ones. Protect the lookup and access to the irq descriptor in show_interrupts() with the sparse_irq_lock. Provide kstat_irqs_usr() which is protecting the lookup and access with sparse_irq_lock and switch /proc/stat to use it. Document the existing kstat_irqs interfaces so it's clear that the caller needs to take care about protection. The users of these interfaces are either not affected due to SPARSE_IRQ=n or already protected against removal. Fixes: 1f5a5b87f78f "genirq: Implement a sane sparse_irq allocator" Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-27in6: fix conflict with glibcstephen hemminger
[ Upstream commit 6d08acd2d32e3e877579315dc3202d7a5f336d98 ] Resolve conflicts between glibc definition of IPV6 socket options and those defined in Linux headers. Looks like earlier efforts to solve this did not cover all the definitions. It resolves warnings during iproute2 build. Please consider for stable as well. Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16mm: propagate error from stack expansion even for guard pageLinus Torvalds
commit fee7e49d45149fba60156f5b59014f764d3e3728 upstream. Jay Foad reports that the address sanitizer test (asan) sometimes gets confused by a stack pointer that ends up being outside the stack vma that is reported by /proc/maps. This happens due to an interaction between RLIMIT_STACK and the guard page: when we do the guard page check, we ignore the potential error from the stack expansion, which effectively results in a missing guard page, since the expected stack expansion won't have been done. And since /proc/maps explicitly ignores the guard page (commit d7824370e263: "mm: fix up some user-visible effects of the stack guard page"), the stack pointer ends up being outside the reported stack area. This is the minimal patch: it just propagates the error. It also effectively makes the guard page part of the stack limit, which in turn measn that the actual real stack is one page less than the stack limit. Let's see if anybody notices. We could teach acct_stack_growth() to allow an extra page for a grow-up/grow-down stack in the rlimit test, but I don't want to add more complexity if it isn't needed. Reported-and-tested-by: Jay Foad <jay.foad@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16tracing/sched: Check preempt_count() for current when reading task->stateSteven Rostedt (Red Hat)
commit aee4e5f3d3abb7a2239dd02f6d8fb173413fd02f upstream. When recording the state of a task for the sched_switch tracepoint a check of task_preempt_count() is performed to see if PREEMPT_ACTIVE is set. This is because, technically, a task being preempted is really in the TASK_RUNNING state, and that is what should be recorded when tracing a sched_switch, even if the task put itself into another state (it hasn't scheduled out in that state yet). But with the change to use per_cpu preempt counts, the task_thread_info(p)->preempt_count is no longer used, and instead task_preempt_count(p) is used. The problem is that this does not use the current preempt count but a stale one from a previous sched_switch. The task_preempt_count(p) uses saved_preempt_count and not preempt_count(). But for tracing sched_switch, if p is current, we really want preempt_count(). I hit this bug when I was tracing sleep and the call from do_nanosleep() scheduled out in the "RUNNING" state. sleep-4290 [000] 537272.259992: sched_switch: sleep:4290 [120] R ==> swapper/0:0 [120] sleep-4290 [000] 537272.260015: kernel_stack: <stack trace> => __schedule (ffffffff8150864a) => schedule (ffffffff815089f8) => do_nanosleep (ffffffff8150b76c) => hrtimer_nanosleep (ffffffff8108d66b) => SyS_nanosleep (ffffffff8108d750) => return_to_handler (ffffffff8150e8e5) => tracesys_phase2 (ffffffff8150c844) After a bit of hair pulling, I found that the state was really TASK_INTERRUPTIBLE, but the saved_preempt_count had an old PREEMPT_ACTIVE set and caused the sched_switch tracepoint to show it as RUNNING. Link: http://lkml.kernel.org/r/20141210174428.3cb7542a@gandalf.local.home Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Fixes: 01028747559a "sched: Create more preempt_count accessors" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-16pstore-ram: Allow optional mapping with pgprot_noncachedTony Lindgren
commit 027bc8b08242c59e19356b4b2c189f2d849ab660 upstream. On some ARMs the memory can be mapped pgprot_noncached() and still be working for atomic operations. As pointed out by Colin Cross <ccross@android.com>, in some cases you do want to use pgprot_noncached() if the SoC supports it to see a debug printk just before a write hanging the system. On ARMs, the atomic operations on strongly ordered memory are implementation defined. So let's provide an optional kernel parameter for configuring pgprot_noncached(), and use pgprot_writecombine() by default. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Rob Herring <robherring2@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Anton Vorontsov <anton@enomsg.org> Cc: Colin Cross <ccross@android.com> Cc: Olof Johansson <olof@lixom.net> Cc: Russell King <linux@arm.linux.org.uk> Acked-by: Kees Cook <keescook@chromium.org> Signed-off-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08audit: restore AUDIT_LOGINUID unset ABIRichard Guy Briggs
commit 041d7b98ffe59c59fdd639931dea7d74f9aa9a59 upstream. A regression was caused by commit 780a7654cee8: audit: Make testing for a valid loginuid explicit. (which in turn attempted to fix a regression caused by e1760bd) When audit_krule_to_data() fills in the rules to get a listing, there was a missing clause to convert back from AUDIT_LOGINUID_SET to AUDIT_LOGINUID. This broke userspace by not returning the same information that was sent and expected. The rule: auditctl -a exit,never -F auid=-1 gives: auditctl -l LIST_RULES: exit,never f24=0 syscall=all when it should give: LIST_RULES: exit,never auid=-1 (0xffffffff) syscall=all Tag it so that it is reported the same way it was set. Create a new private flags audit_krule field (pflags) to store it that won't interact with the public one from the API. Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Paul Moore <pmoore@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08userns: Add a knob to disable setgroups on a per user namespace basisEric W. Biederman
commit 9cc46516ddf497ea16e8d7cb986ae03a0f6b92f8 upstream. - Expose the knob to user space through a proc file /proc/<pid>/setgroups A value of "deny" means the setgroups system call is disabled in the current processes user namespace and can not be enabled in the future in this user namespace. A value of "allow" means the segtoups system call is enabled. - Descendant user namespaces inherit the value of setgroups from their parents. - A proc file is used (instead of a sysctl) as sysctls currently do not allow checking the permissions at open time. - Writing to the proc file is restricted to before the gid_map for the user namespace is set. This ensures that disabling setgroups at a user namespace level will never remove the ability to call setgroups from a process that already has that ability. A process may opt in to the setgroups disable for itself by creating, entering and configuring a user namespace or by calling setns on an existing user namespace with setgroups disabled. Processes without privileges already can not call setgroups so this is a noop. Prodcess with privilege become processes without privilege when entering a user namespace and as with any other path to dropping privilege they would not have the ability to call setgroups. So this remains within the bounds of what is possible without a knob to disable setgroups permanently in a user namespace. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08userns: Don't allow setgroups until a gid mapping has been setablishedEric W. Biederman
commit 273d2c67c3e179adb1e74f403d1e9a06e3f841b5 upstream. setgroups is unique in not needing a valid mapping before it can be called, in the case of setgroups(0, NULL) which drops all supplemental groups. The design of the user namespace assumes that CAP_SETGID can not actually be used until a gid mapping is established. Therefore add a helper function to see if the user namespace gid mapping has been established and call that function in the setgroups permission check. This is part of the fix for CVE-2014-8989, being able to drop groups without privilege using user namespaces. Reviewed-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2015-01-08groups: Consolidate the setgroups permission checksEric W. Biederman
commit 7ff4d90b4c24a03666f296c3d4878cd39001e81e upstream. Today there are 3 instances of setgroups and due to an oversight their permission checking has diverged. Add a common function so that they may all share the same permission checking code. This corrects the current oversight in the current permission checks and adds a helper to avoid this in the future. A user namespace security fix will update this new helper, shortly. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06bitops: Fix shift overflow in GENMASK macrosMaxime COQUELIN
commit 00b4d9a14125f1e51874def2b9de6092e007412d upstream. On some 32 bits architectures, including x86, GENMASK(31, 0) returns 0 instead of the expected ~0UL. This is the same on some 64 bits architectures with GENMASK_ULL(63, 0). This is due to an overflow in the shift operand, 1 << 32 for GENMASK, 1 << 64 for GENMASK_ULL. Reported-by: Eric Paire <eric.paire@st.com> Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Signed-off-by: Maxime Coquelin <maxime.coquelin@st.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux@rasmusvillemoes.dk Cc: gong.chen@linux.intel.com Cc: John Sullivan <jsrhbz@kanargh.force9.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Theodore Ts'o <tytso@mit.edu> Fixes: 10ef6b0dffe4 ("bitops: Introduce a more generic BITMASK macro") Link: http://lkml.kernel.org/r/1415267659-10563-1-git-send-email-maxime.coquelin@st.com Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06iio: Fix IIO_EVENT_CODE_EXTRACT_DIR bit maskCristina Ciocan
commit ccf54555da9a5e91e454b909ca6a5303c7d6b910 upstream. The direction field is set on 7 bits, thus we need to AND it with 0111 111 mask in order to retrieve it, that is 0x7F, not 0xCF as it is now. Fixes: ade7ef7ba (staging:iio: Differential channel handling) Signed-off-by: Cristina Ciocan <cristina.ciocan@intel.com> Signed-off-by: Jonathan Cameron <jic23@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06ASoC: dpcm: Fix race between FE/BE updates and triggerTakashi Iwai
commit ea9d0d771fcd32cd56070819749477d511ec9117 upstream. DPCM can update the FE/BE connection states totally asynchronously from the FE's PCM state. Most of FE/BE state changes are protected by mutex, so that they won't race, but there are still some actions that are uncovered. For example, suppose to switch a BE while a FE's stream is running. This would call soc_dpcm_runtime_update(), which sets FE's runtime_update flag, then sets up and starts BEs, and clears FE's runtime_update flag again. When a device emits XRUN during this operation, the PCM core triggers snd_pcm_stop(XRUN). Since the trigger action is an atomic ops, this isn't blocked by the mutex, thus it kicks off DPCM's trigger action. It eventually updates and clears FE's runtime_update flag while soc_dpcm_runtime_update() is running concurrently, and it results in confusion. Usually, for avoiding such a race, we take a lock. There is a PCM stream lock for that purpose. However, as already mentioned, the trigger action is atomic, and we can't take the lock for the whole soc_dpcm_runtime_update() or other operations that include the lengthy jobs like hw_params or prepare. This patch provides an alternative solution. This adds a way to defer the conflicting trigger callback to be executed at the end of FE/BE state changes. For doing it, two things are introduced: - Each runtime_update state change of FEs is protected via PCM stream lock. - The FE's trigger callback checks the runtime_update flag. If it's not set, the trigger action is executed there. If set, mark the pending trigger action and returns immediately. - At the exit of runtime_update state change, it checks whether the pending trigger is present. If yes, it executes the trigger action at this point. Reported-and-tested-by: Qiao Zhou <zhouqiao@marvell.com> Signed-off-by: Takashi Iwai <tiwai@suse.de> Acked-by: Liam Girdwood <liam.r.girdwood@linux.intel.com> Signed-off-by: Mark Brown <broonie@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06PCI/MSI: Add device flag indicating that 64-bit MSIs don't workBenjamin Herrenschmidt
commit f144d1496b47e7450f41b767d0d91c724c2198bc upstream. This can be set by quirks/drivers to be used by the architecture code that assigns the MSI addresses. We additionally add verification in the core MSI code that the values assigned by the architecture do satisfy the limitation in order to fail gracefully if they don't (ie. the arch hasn't been updated to deal with that quirk yet). Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-12-06inetdevice: fixed signed integer overflowVincent BENAYOUN
[ Upstream commit 84bc88688e3f6ef843aa8803dbcd90168bb89faf ] There could be a signed overflow in the following code. The expression, (32-logmask) is comprised between 0 and 31 included. It may be equal to 31. In such a case the left shift will produce a signed integer overflow. According to the C99 Standard, this is an undefined behavior. A simple fix is to replace the signed int 1 with the unsigned int 1U. Signed-off-by: Vincent BENAYOUN <vincent.benayoun@trust-in-soft.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm/compaction: do not count migratepages when unnecessaryVlastimil Babka
commit f8c9301fa5a2a8b873c67f2a3d8230d5c13f61b7 upstream. During compaction, update_nr_listpages() has been used to count remaining non-migrated and free pages after a call to migrage_pages(). The freepages counting has become unneccessary, and it turns out that migratepages counting is also unnecessary in most cases. The only situation when it's needed to count cc->migratepages is when migrate_pages() returns with a negative error code. Otherwise, the non-negative return value is the number of pages that were not migrated, which is exactly the count of remaining pages in the cc->migratepages list. Furthermore, any non-zero count is only interesting for the tracepoint of mm_compaction_migratepages events, because after that all remaining unmigrated pages are put back and their count is set to 0. This patch therefore removes update_nr_listpages() completely, and changes the tracepoint definition so that the manual counting is done only when the tracepoint is enabled, and only when migrate_pages() returns a negative error code. Furthermore, migrate_pages() and the tracepoints won't be called when there's nothing to migrate. This potentially avoids some wasted cycles and reduces the volume of uninteresting mm_compaction_migratepages events where "nr_migrated=0 nr_failed=0". In the stress-highalloc mmtest, this was about 75% of the events. The mm_compaction_isolate_migratepages event is better for determining that nothing was isolated for migration, and this one was just duplicating the info. Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Acked-by: Michal Nazarewicz <mina86@mina86.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm, compaction: embed migration mode in compact_controlDavid Rientjes
commit e0b9daeb453e602a95ea43853dc12d385558ce1f upstream. We're going to want to manipulate the migration mode for compaction in the page allocator, and currently compact_control's sync field is only a bool. Currently, we only do MIGRATE_ASYNC or MIGRATE_SYNC_LIGHT compaction depending on the value of this bool. Convert the bool to enum migrate_mode and pass the migration mode in directly. Later, we'll want to avoid MIGRATE_SYNC_LIGHT for thp allocations in the pagefault patch to avoid unnecessary latency. This also alters compaction triggered from sysfs, either for the entire system or for a node, to force MIGRATE_SYNC. [akpm@linux-foundation.org: fix build] [iamjoonsoo.kim@lge.com: use MIGRATE_SYNC in alloc_contig_range()] Signed-off-by: David Rientjes <rientjes@google.com> Suggested-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm, compaction: add per-zone migration pfn cache for async compactionDavid Rientjes
commit 35979ef3393110ff3c12c6b94552208d3bdf1a36 upstream. Each zone has a cached migration scanner pfn for memory compaction so that subsequent calls to memory compaction can start where the previous call left off. Currently, the compaction migration scanner only updates the per-zone cached pfn when pageblocks were not skipped for async compaction. This creates a dependency on calling sync compaction to avoid having subsequent calls to async compaction from scanning an enormous amount of non-MOVABLE pageblocks each time it is called. On large machines, this could be potentially very expensive. This patch adds a per-zone cached migration scanner pfn only for async compaction. It is updated everytime a pageblock has been scanned in its entirety and when no pages from it were successfully isolated. The cached migration scanner pfn for sync compaction is updated only when called for sync compaction. Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Greg Thelen <gthelen@google.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm, migration: add destination page freeing callbackDavid Rientjes
commit 68711a746345c44ae00c64d8dbac6a9ce13ac54a upstream. Memory migration uses a callback defined by the caller to determine how to allocate destination pages. When migration fails for a source page, however, it frees the destination page back to the system. This patch adds a memory migration callback defined by the caller to determine how to free destination pages. If a caller, such as memory compaction, builds its own freelist for migration targets, this can reuse already freed memory instead of scanning additional memory. If the caller provides a function to handle freeing of destination pages, it is called when page migration fails. If the caller passes NULL then freeing back to the system will be handled as usual. This patch introduces no functional change. Signed-off-by: David Rientjes <rientjes@google.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm/readahead.c: inline ra_submitFabian Frederick
commit 29f175d125f0f3a9503af8a5596f93d714cceb08 upstream. Commit f9acc8c7b35a ("readahead: sanify file_ra_state names") left ra_submit with a single function call. Move ra_submit to internal.h and inline it to save some stack. Thanks to Andrew Morton for commenting different versions. Signed-off-by: Fabian Frederick <fabf@skynet.be> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm: remove read_cache_page_async()Sasha Levin
commit 67f9fd91f93c582b7de2ab9325b6e179db77e4d5 upstream. This patch removes read_cache_page_async() which wasn't really needed anywhere and simplifies the code around it a bit. read_cache_page_async() is useful when we want to read a page into the cache without waiting for it to complete. This happens when the appropriate callback 'filler' doesn't complete its read operation and releases the page lock immediately, and instead queues a different completion routine to do that. This never actually happened anywhere in the code. read_cache_page_async() had 3 different callers: - read_cache_page() which is the sync version, it would just wait for the requested read to complete using wait_on_page_read(). - JFFS2 would call it from jffs2_gc_fetch_page(), but the filler function it supplied doesn't do any async reads, and would complete before the filler function returns - making it actually a sync read. - CRAMFS would call it using the read_mapping_page_async() wrapper, with a similar story to JFFS2 - the filler function doesn't do anything that reminds async reads and would always complete before the filler function returns. To sum it up, the code in mm/filemap.c never took advantage of having read_cache_page_async(). While there are filler callbacks that do async reads (such as the block one), we always called it with the read_cache_page(). This patch adds a mandatory wait for read to complete when adding a new page to the cache, and removes read_cache_page_async() and its wrappers. Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm + fs: prepare for non-page entries in page cache radix treesJohannes Weiner
commit 0cd6144aadd2afd19d1aca880153530c52957604 upstream. shmem mappings already contain exceptional entries where swap slot information is remembered. To be able to store eviction information for regular page cache, prepare every site dealing with the radix trees directly to handle entries other than pages. The common lookup functions will filter out non-page entries and return NULL for page cache holes, just as before. But provide a raw version of the API which returns non-page entries as well, and switch shmem over to use it. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21mm: filemap: move radix tree hole searching hereJohannes Weiner
commit e7b563bb2a6f4d974208da46200784b9c5b5a47e upstream. The radix tree hole searching code is only used for page cache, for example the readahead code trying to get a a picture of the area surrounding a fault. It sufficed to rely on the radix tree definition of holes, which is "empty tree slot". But this is about to change, though, as shadow page descriptors will be stored in the page cache after the actual pages get evicted from memory. Move the functions over to mm/filemap.c and make them native page cache operations, where they can later be adapted to handle the new definition of "page cache hole". Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel <riel@redhat.com> Reviewed-by: Minchan Kim <minchan@kernel.org> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21lib: radix-tree: add radix_tree_delete_item()Johannes Weiner
commit 53c59f262d747ea82e7414774c59a489501186a0 upstream. Provide a function that does not just delete an entry at a given index, but also allows passing in an expected item. Delete only if that item is still located at the specified index. This is handy when lockless tree traversals want to delete entries as well because they don't have to do an second, locked lookup to verify the slot has not changed under them before deleting the entry. Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Luigi Semenzato <semenzato@google.com> Cc: Metin Doslu <metin@citusdata.com> Cc: Michel Lespinasse <walken@google.com> Cc: Ozgun Erdogan <ozgun@citusdata.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <klamm@yandex-team.ru> Cc: Ryan Mallon <rmallon@gmail.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21net: sctp: fix skb_over_panic when receiving malformed ASCONF chunksDaniel Borkmann
commit 9de7922bc709eee2f609cd01d98aaedc4cf5ea74 upstream. Commit 6f4c618ddb0 ("SCTP : Add paramters validity check for ASCONF chunk") added basic verification of ASCONF chunks, however, it is still possible to remotely crash a server by sending a special crafted ASCONF chunk, even up to pre 2.6.12 kernels: skb_over_panic: text:ffffffffa01ea1c3 len:31056 put:30768 head:ffff88011bd81800 data:ffff88011bd81800 tail:0x7950 end:0x440 dev:<NULL> ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:129! [...] Call Trace: <IRQ> [<ffffffff8144fb1c>] skb_put+0x5c/0x70 [<ffffffffa01ea1c3>] sctp_addto_chunk+0x63/0xd0 [sctp] [<ffffffffa01eadaf>] sctp_process_asconf+0x1af/0x540 [sctp] [<ffffffff8152d025>] ? _read_unlock_bh+0x15/0x20 [<ffffffffa01e0038>] sctp_sf_do_asconf+0x168/0x240 [sctp] [<ffffffffa01e3751>] sctp_do_sm+0x71/0x1210 [sctp] [<ffffffff8147645d>] ? fib_rules_lookup+0xad/0xf0 [<ffffffffa01e6b22>] ? sctp_cmp_addr_exact+0x32/0x40 [sctp] [<ffffffffa01e8393>] sctp_assoc_bh_rcv+0xd3/0x180 [sctp] [<ffffffffa01ee986>] sctp_inq_push+0x56/0x80 [sctp] [<ffffffffa01fcc42>] sctp_rcv+0x982/0xa10 [sctp] [<ffffffffa01d5123>] ? ipt_local_in_hook+0x23/0x28 [iptable_filter] [<ffffffff8148bdc9>] ? nf_iterate+0x69/0xb0 [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0 [<ffffffff8148bf86>] ? nf_hook_slow+0x76/0x120 [<ffffffff81496d10>] ? ip_local_deliver_finish+0x0/0x2d0 [<ffffffff81496ded>] ip_local_deliver_finish+0xdd/0x2d0 [<ffffffff81497078>] ip_local_deliver+0x98/0xa0 [<ffffffff8149653d>] ip_rcv_finish+0x12d/0x440 [<ffffffff81496ac5>] ip_rcv+0x275/0x350 [<ffffffff8145c88b>] __netif_receive_skb+0x4ab/0x750 [<ffffffff81460588>] netif_receive_skb+0x58/0x60 This can be triggered e.g., through a simple scripted nmap connection scan injecting the chunk after the handshake, for example, ... -------------- INIT[ASCONF; ASCONF_ACK] -------------> <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------ -------------------- COOKIE-ECHO --------------------> <-------------------- COOKIE-ACK --------------------- ------------------ ASCONF; UNKNOWN ------------------> ... where ASCONF chunk of length 280 contains 2 parameters ... 1) Add IP address parameter (param length: 16) 2) Add/del IP address parameter (param length: 255) ... followed by an UNKNOWN chunk of e.g. 4 bytes. Here, the Address Parameter in the ASCONF chunk is even missing, too. This is just an example and similarly-crafted ASCONF chunks could be used just as well. The ASCONF chunk passes through sctp_verify_asconf() as all parameters passed sanity checks, and after walking, we ended up successfully at the chunk end boundary, and thus may invoke sctp_process_asconf(). Parameter walking is done with WORD_ROUND() to take padding into account. In sctp_process_asconf()'s TLV processing, we may fail in sctp_process_asconf_param() e.g., due to removal of the IP address that is also the source address of the packet containing the ASCONF chunk, and thus we need to add all TLVs after the failure to our ASCONF response to remote via helper function sctp_add_asconf_response(), which basically invokes a sctp_addto_chunk() adding the error parameters to the given skb. When walking to the next parameter this time, we proceed with ... length = ntohs(asconf_param->param_hdr.length); asconf_param = (void *)asconf_param + length; ... instead of the WORD_ROUND()'ed length, thus resulting here in an off-by-one that leads to reading the follow-up garbage parameter length of 12336, and thus throwing an skb_over_panic for the reply when trying to sctp_addto_chunk() next time, which implicitly calls the skb_put() with that length. Fix it by using sctp_walk_params() [ which is also used in INIT parameter processing ] macro in the verification *and* in ASCONF processing: it will make sure we don't spill over, that we walk parameters WORD_ROUND()'ed. Moreover, we're being more defensive and guard against unknown parameter types and missized addresses. Joint work with Vlad Yasevich. Fixes: b896b82be4ae ("[SCTP] ADDIP: Support for processing incoming ASCONF_ACK chunks.") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2014-11-21net: sctp: fix panic on duplicate ASCONF chunksDaniel Borkmann
commit b69040d8e39f20d5215a03502a8e8b4c6ab78395 upstream. When receiving a e.g. semi-good formed connection scan in the form of ... -------------- INIT[ASCONF; ASCONF_ACK] -------------> <----------- INIT-ACK[ASCONF; ASCONF_ACK] ------------ -------------------- COOKIE-ECHO --------------------> <-------------------- COOKIE-ACK --------------------- ---------------- ASCONF_a; ASCONF_b -----------------> ... where ASCONF_a equals ASCONF_b chunk (at least both serials need to be equal), we panic an SCTP server! The problem is that good-formed ASCONF chunks that we reply with ASCONF_ACK chunks are cached per serial. Thus, when we receive a same ASCONF chunk twice (e.g. through a lost ASCONF_ACK), we do not need to process them again on the server side (that was the idea, also proposed in the RFC). Instead, we know it was cached and we just resend the cached chunk instead. So far, so good. Where things get nasty is in SCTP's side effect interpreter, that is, sctp_cmd_interpreter(): While incoming ASCONF_a (chunk = event_arg) is being marked !end_of_packet and !singleton, and we have an association context, we do not flush the outqueue the first time after processing the ASCONF_ACK singleton chunk via SCTP_CMD_REPLY. Instead, we keep it queued up, although we set local_cork to 1. Commit 2e3216cd54b1 changed the precedence, so that as long as we get bundled, incoming chunks we try possible bundling on outgoing queue as well. Before this commit, we would just flush the output queue. Now, while ASCONF_a's ASCONF_ACK sits in the corked outq, we continue to process the same ASCONF_b chunk from the packet. As we have cached the previous ASCONF_ACK, we find it, grab it and do another SCTP_CMD_REPLY command on it. So, effectively, we rip the chunk->list pointers and requeue the same ASCONF_ACK chunk another time. Since we process ASCONF_b, it's correctly marked with end_of_packet and we enforce an uncork, and thus flush, thus crashing the kernel. Fix it by testing if the ASCONF_ACK is currently pending and if that is the case, do not requeue it. When flushing the output queue we may relink the chunk for preparing an outgoing packet, but eventually unlink it when it's copied into the skb right before transmission. Joint work with Vlad Yasevich. Fixes: 2e3216cd54b1 ("sctp: Follow security requirement of responding with 1 packet") Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Cc: Josh Boyer <jwboyer@fedoraproject.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>