aboutsummaryrefslogtreecommitdiff
path: root/drivers/md/md-cluster.c
AgeCommit message (Collapse)Author
2015-10-24md-cluster: remove mddev arg from add_resync_info()NeilBrown
The arg isn't used, so its presence is only confusing. Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24md-cluster: don't cast void pointers when assigning them.NeilBrown
It is common practice in the kernel to leave out this case. It isn't needed and adds little if any value. Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24md-cluster: discard unused sb_mutex.NeilBrown
Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-24md-cluster: Fix warnings when build with CF=-D__CHECK_ENDIAN__Guoqing Jiang
This patches fixes sparse warnings like incorrect type in assignment (different base types), cast to restricted __le64. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-16md-cluster: metadata_update_finish: consistently use cmsg.raid_slot as le32NeilBrown
As cmsg.raid_slot is le32, comparing for >0 is not meaningful. So introduce cpu-endian 'raid_slot' and only assign to cmsg.raid_slot when we know value is valid. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-12md-cluster: Add 'SUSE' as author for md-cluster.cGuoqing Jiang
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12md-cluster: zero cmsg before it was sentGuoqing Jiang
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12md-cluster: make sure the node do not receive it's own msgGuoqing Jiang
During the past test, the node occasionally received the msg which is sent from itself, this case should not happen in theory, but it is better to avoid it in case something wrong happened. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: remove unnecessary setting for slotGuoqing Jiang
Since slot will be set within _sendmsg, we can remove the redundant code in resync_info_update. Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12md-cluster: make other members of cluster_msg is handled by little endian funcsGuoqing Jiang
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
2015-10-12md-cluster: Do not printk() every received messageGoldwyn Rodrigues
The receive daemon prints kernel messages for every network message received. This would fill the kernel message log with unnecessary messages. Remove the pr_info() messages. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Fix adding of new disk with new reload codeGoldwyn Rodrigues
Adding the disk worked incorrectly with the new reload code. Fix it: - No operation should be performed on rdev marked as Candidate - After a metadata update operation, kick disk if role is 0xfffe else clear Candidate bit and continue with the regular change check. - Saving the mode of the lock resource to check if token lock is already locked, because it can be called twice while adding a disk. However, unlock_comm() must be called only once. - add_new_disk() is called by the node initiating the --add operation. If it needs to be canceled, call add_new_disk_cancel(). The operation is completed by md_update_sb() which will write and unlock the communication. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Perform resync/recovery under a DLM lockGoldwyn Rodrigues
Resync or recovery must be performed by only one node at a time. A DLM lock resource, resync_lockres provides the mutual exclusion so that only one node performs the recovery/resync at a time. If a node is unable to get the resync_lockres, because recovery is being performed by another node, it set MD_RECOVER_NEEDED so as to schedule recovery in the future. Remove the debug message in resync_info_update() used during development. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Improve md_reload_sb to be less error proneGoldwyn Rodrigues
md_reload_sb is too simplistic and it explicitly needs to determine the changes made by the writing node. However, there are multiple areas where a simple reload could fail. Instead, read the superblock of one of the "good" rdevs and update the necessary information: - read the superblock into a newly allocated page, by temporarily swapping out rdev->sb_page and calling ->load_super. - if that fails return - if it succeeds, call check_sb_changes 1. iterates over list of active devices and checks the matching dev_roles[] value. If that is 'faulty', the device must be marked as faulty - call md_error to mark the device as faulty. Make sure not to set CHANGE_DEVS and wakeup mddev->thread or else it would initiate a resync process, which is the responsibility of the "primary" node. - clear the Blocked bit - Call remove_and_add_spares() to hot remove the device. If the device is 'spare': - call remove_and_add_spares() to get the number of spares added in this operation. - Reduce mddev->degraded to mark the array as not degraded. 2. reset recovery_cp - read the rest of the rdevs to update recovery_offset. If recovery_offset is equal to MaxSector, call spare_active() to set it In_sync This required that recovery_offset be initialized to MaxSector, as opposed to zero so as to communicate the end of sync for a rdev. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: Wake up suspended processGoldwyn Rodrigues
When the suspended_area is deleted, the suspended processes must be woken up in order to complete their I/O. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-10-12md-cluster: send BITMAP_NEEDS_SYNC when node is leaving clusterGuoqing Jiang
Previously, BITMAP_NEEDS_SYNC message is sent when the resyc aborts, but it could abort for different reasons, and not all of reasons require another node to take over the resync ownship. It is better make BITMAP_NEEDS_SYNC message only be sent when the node is leaving cluster with dirty bitmap. And we also need to ensure dlm connection is ok. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-10-12md-cluster: Use a small window for resyncGoldwyn Rodrigues
Suspending the entire device for resync could take too long. Resync in small chunks. cluster's resync window (32M) is maintained in r1conf as cluster_sync_low and cluster_sync_high and processed in raid1's sync_request(). If the current resync is outside the cluster resync window: 1. Set the cluster_sync_low to curr_resync_completed. 2. Check if the sync will fit in the new window, if not issue a wait_barrier() and set cluster_sync_low to sector_nr. 3. Set cluster_sync_high to cluster_sync_low + resync_window. 4. Send a message to all nodes so they may add it in their suspension list. bitmap_cond_end_sync is modified to allow to force a sync inorder to get the curr_resync_completed uptodate with the sector passed. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-10-12md-cluster: complete all write requests before adding suspend_infoGoldwyn Rodrigues
process_suspend_info - which handles the RESYNCING request - must not reply until all writes which were initiated before the request arrived, have completed. As a by-product, all process_* functions now take mddev as their first arguement making it uniform. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: remove inappropriate try_module_get from join()NeilBrown
md_setup_cluster already calls try_module_get(), so this try_module_get isn't needed. Also, there is no matching module_put (except in error patch), so this leaves an unbalanced module count. Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: Read the disk bitmap sb and check if it needs recoveryGuoqing Jiang
In gather_all_resync_info, we need to read the disk bitmap sb and check if it needs recovery. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: only call complete(&cinfo->completion) when node join clusterGuoqing Jiang
Introduce MD_CLUSTER_BEGIN_JOIN_CLUSTER flag to make sure complete(&cinfo->completion) is only be invoked when node join cluster. Otherwise node failure could also call the complete, and it doesn't make sense to do it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: add missed lockres_freeGuoqing Jiang
We also need to free the lock resource before goto out. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: remove the unused sb_lockGuoqing Jiang
The sb_lock is not used anywhere, so let's remove it. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: init suspend_list and suspend_lock early in joinGuoqing Jiang
If the node just join the cluster, and receive the msg from other nodes before init suspend_list, it will cause kernel crash due to NULL pointer dereference, so move the initializations early to fix the bug. md-cluster: Joined cluster 3578507b-e0cb-6d4f-6322-696cd7b1b10c slot 3 BUG: unable to handle kernel NULL pointer dereference at (null) ... ... ... Call Trace: [<ffffffffa0444924>] process_recvd_msg+0x2e4/0x330 [md_cluster] [<ffffffffa0444a06>] recv_daemon+0x96/0x170 [md_cluster] [<ffffffffa045189d>] md_thread+0x11d/0x170 [md_mod] [<ffffffff810768c4>] kthread+0xb4/0xc0 [<ffffffff8151927c>] ret_from_fork+0x7c/0xb0 ... ... ... RIP [<ffffffffa0443581>] __remove_suspend_info+0x11/0xa0 [md_cluster] Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: add the error check if failed to get dlm lockGuoqing Jiang
In complicated cluster environment, it is possible that the dlm lock couldn't be get/convert on purpose, the related err info is added for better debug potential issue. For lockres_free, if the lock is blocking by a lock request or conversion request, then dlm_unlock just put it back to grant queue, so need to ensure the lock is free finally. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: init completion within lockres_initGuoqing Jiang
We should init completion within lockres_init, otherwise completion could be initialized more than one time during it's life cycle. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: fix deadlock issue on message lockGuoqing Jiang
There is problem with previous communication mechanism, and we got below deadlock scenario with cluster which has 3 nodes. Sender Receiver Receiver token(EX) message(EX) writes message downconverts message(CR) requests ack(EX) get message(CR) gets message(CR) reads message reads message requests EX on message requests EX on message To fix this problem, we do the following changes: 1. the sender downconverts MESSAGE to CW rather than CR. 2. and the receiver request PR lock not EX lock on message. And in case we failed to down-convert EX to CW on message, it is better to unlock message otherthan still hold the lock. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Lidong Zhong <ldzhong@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: transfer the resync ownership to another nodeGuoqing Jiang
When node A stops an array while the array is doing a resync, we need to let another node B take over the resync task. To achieve the goal, we need the A send an explicit BITMAP_NEEDS_SYNC message to the cluster. And the node B which received that message will invoke __recover_slot to do resync. Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: split recover_slot for future code reuseGuoqing Jiang
Make recover_slot as a wraper to __recover_slot, since the logic of __recover_slot can be reused for the condition when other nodes need to take over the resync job. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-08-31md-cluster: use %pU to print UUIDsGuoqing Jiang
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-07-24Fix read-balancing during node failureGoldwyn Rodrigues
During a node failure, We need to suspend read balancing so that the reads are directed to the first device and stale data is not read. Suspending writes is not required because these would be recorded and synced eventually. A new flag MD_CLUSTER_SUSPEND_READ_BALANCING is set in recover_prep(). area_resyncing() will respond true for the entire devices if this flag is set and the request type is READ. The flag is cleared in recover_done(). Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reported-By: David Teigland <teigland@redhat.com> Signed-off-by: NeilBrown <neilb@suse.com>
2015-04-22md-cluster: re-add capabilitiesGoldwyn Rodrigues
When "re-add" is writted to /sys/block/mdXX/md/dev-YYY/state, the clustered md: 1. Sends RE_ADD message with the desc_nr. Nodes receiving the message clear the Faulty bit in their respective rdev->flags. 2. The node initiating re-add, gathers the bitmaps of all nodes and copies them into the local bitmap. It does not clear the bitmap from which it is copying. 3. Initiating node schedules a md recovery to sync the devices. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md-cluster: remove capabilitiesGoldwyn Rodrigues
This adds "remove" capabilities for the clustered environment. When a user initiates removal of a device from the array, a REMOVE message with disk number in the array is sent to all the nodes which kick the respective device in their own array. This facilitates the removal of failed devices. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-04-22md-cluster: correct the num for comparisonGuoqing Jiang
Since the node num of md-cluster is from zero, and cinfo->slot_number represents the slot num of dlm, no need to check for equality. Signed-off-by: Guoqing Jiang <gqjiang@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-21md/cluster: Communication Framework: fix semicolon.cocci warningskbuild test robot
drivers/md/md-cluster.c:328:2-3: Unneeded semicolon Removes unneeded semicolon. Generated by: scripts/coccinelle/misc/semicolon.cocci Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-21md: recover_bitmaps() can be statickbuild test robot
drivers/md/md-cluster.c:190:6: sparse: symbol 'recover_bitmaps' was not declared. Should it be static? Signed-off-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-03-21md: Fix stray --cluster-confirm crashGoldwyn Rodrigues
A --cluster-confirm without an --add (by another node) can crash the kernel. Fix it by guarding it using a state. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: NeilBrown <neilb@suse.de>
2015-02-23Add new disk to clustered arrayGoldwyn Rodrigues
Algorithm: 1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD) 2. Node 1 sends NEWDISK with uuid and slot number 3. Other nodes issue kobject_uevent_env with uuid and slot number (Steps 4,5 could be a udev rule) 4. In userspace, the node searches for the disk, perhaps using blkid -t SUB_UUID="" 5. Other nodes issue either of the following depending on whether the disk was found: ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and disc.number set to slot number) ioctl(CLUSTERED_DISK_NACK) 6. Other nodes drop lock on no-new-devs (CR) if device is found 7. Node 1 attempts EX lock on no-new-devs 8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk as SpareLocal 9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED 10. Other nodes understand if the device is added or not by reading the superblock again after receiving the METADATA_UPDATED message. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Suspend writes in RAID1 if within rangeGoldwyn Rodrigues
If there is a resync going on, all nodes must suspend writes to the range. This is recorded in the suspend_info/suspend_list. If there is an I/O within the ranges of any of the suspend_info, should_suspend will return 1. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Resync start/Finish actionsGoldwyn Rodrigues
When a RESYNC_START message arrives, the node removes the entry with the current slot number and adds the range to the suspend_list. Simlarly, when a RESYNC_FINISHED message is received, node clears entry with respect to the bitmap number. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Send RESYNCING while performing resync start/stopGoldwyn Rodrigues
When a resync is initiated, RESYNCING message is sent to all active nodes with the range (lo,hi). When the resync is over, a RESYNCING message is sent with (0,0). A high sector value of zero indicates that the resync is over. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Reload superblock if METADATA_UPDATED is receivedGoldwyn Rodrigues
Re-reads the devices by invalidating the cache. Since we don't write to faulty devices, this is detected using events recorded in the devices. If it is old as compared to the mddev mark it is faulty. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23metadata_update sends message to other nodesGoldwyn Rodrigues
- request to send a message - make changes to superblock - send messages telling everyone that the superblock has changed - other nodes all read the superblock - other nodes all ack the messages - updating node release the "I'm sending a message" resource. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Communication Framework: Sending functionsGoldwyn Rodrigues
The sending part is split in two functions to make sure atomicity of the operations, such as the MD superblock update. Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Communication Framework: ReceivingGoldwyn Rodrigues
1. receive status sender receiver receiver ACK:CR ACK:CR ACK:CR 2. sender get EX of TOKEN sender get EX of MESSAGE sender receiver receiver TOKEN:EX ACK:CR ACK:CR MESSAGE:EX ACK:CR 3. sender write LVB. sender down-convert MESSAGE from EX to CR sender try to get EX of ACK [ wait until all receiver has *processed* the MESSAGE ] [ triggered by bast of ACK ] receiver get CR of MESSAGE receiver read LVB receiver processes the message [ wait finish ] receiver release ACK sender receiver receiver TOKEN:EX MESSAGE:CR MESSAGE:CR MESSAGE:CR ACK:EX 4. sender down-convert ACK from EX to CR sender release MESSAGE sender release TOKEN receiver upconvert to EX of MESSAGE receiver get CR of ACK receiver release MESSAGE sender receiver receiver ACK:CR ACK:CR ACK:CR Signed-off-by: Lidong Zhong <lzhong@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Perform resync for cluster node failureGoldwyn Rodrigues
If bitmap_copy_slot returns hi>0, we need to perform resync. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Initiate recovery on node failureGoldwyn Rodrigues
The DLM informs us in case of node failure with the DLM slot number. cluster_info->recovery_map sets the bit corresponding to the slot number and wakes up the recovery thread. The recovery thread: 1. Derives the slot number from the recovery_map 2. Locks the bitmap corresponding to the slot 3. Copies the set bits to the node-local bitmap Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Gather on-going resync information of other nodesGoldwyn Rodrigues
When a node joins, it does not know of other nodes performing resync. So, each node keeps the resync information in it's LVB. When a new node joins, it reads the LVB of each "online" bitmap. [TODO] The new node attempts to get the PW lock on other bitmap, if it is successful, it reads the bitmap and performs the resync (if required) on it's behalf. If the node does not get the PW, it requests CR and reads the LVB for the resync information. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Lock bitmap while joining the clusterGoldwyn Rodrigues
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
2015-02-23Use separate bitmaps for each nodes in the clusterGoldwyn Rodrigues
On-disk format: 0 4k 8k 12k ------------------------------------------------------------------- | idle | md super | bm super [0] + bits | | bm bits[0, contd] | bm super[1] + bits | bm bits[1, contd] | | bm super[2] + bits | bm bits [2, contd] | bm super[3] + bits | | bm bits [3, contd] | | | Bitmap super has a field nodes, which defines the maximum number of nodes the device can use. While reading the bitmap super, if the cluster finds out that the number of nodes is > 0: 1. Requests the md-cluster module. 2. Calls md_cluster_ops->join(), which sets up clustering such as joining DLM lockspace. Since the first time, the first bitmap is read. After the call to the cluster_setup, the bitmap offset is adjusted and the superblock is re-read. This also ensures the bitmap is read the bitmap lock (when bitmap lock is introduced in later patches) Questions: 1. cluster name is repeated in all bitmap supers. Is that okay? Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>