ABD hang on boot
Categories: Troubleshooting, XenServer, Alike v4
When you run any Enhanced (ABD based) job in XenServer, it fails to complete, with 2 distinct behaviors.
- Failure to begin the backup process, usually giving the error:
“ABD not ready. Failed to connect to xx.xx.xx.xx after 300 seconds”
- The temporary ABD provisioned (then removed) has no output on its console in XenCenter, and its CPU stays at 100% until removed.
Both errors (especially the second) must be present to fit the profile of this issue. The ABD not ready error can occur on its own in many situations, usually when there is a networking problem preventing the Alike server from ssh-ing into the ABD.
To determine if you are experiencing this issue, please perform the following steps:
- Run an “ABD Diagnostic” job from Alike (under Tools->Manage ABDs)
- Wait for the job to time out (~5 minutes) with the following errors:
- While the ABD is stalled, check XenCenter to see the ABD’s CPU usage is 100% (Please note: ABDs are hidden by default in XC)
- Look at the XenCenter “console” tab for the effected ABD (while the job is still running). The ABD’s console must be blank (all black or grey– no text)
The exact cause of this behavior is unknown. It is an error that XenServer experiences while handing PV (Paravirtualized) drivers for its guest VMs. Specifically, the serial/console guest PV driver (hvc0) appears to be causing this problem. It is important to note that this behavior (hang on boot with 100% CPU usage) will effect any fully PV guest that attempts to load the hvc0 driver.
Because of the extreme rarity of this problem, and that it is due to an underlying XenServer or hardware issue, the workaround for this problem will not be folded into the mainline release of Alike.
If you followed the steps above, and are experiencing this issue, please contact Quadric Support for access to a private workaround update.