Troubleshooting vGPU on XenServer

vGPU on XenServer is awsome technology. But as with any other software, it can happen that you run into issues. For example, when starting a VM which has a vGPU profile assigned, the VM can fail with the dreaded “vgpu exited unexpectedly” error. To be more precise, such an error could read:

Internal error: xenopsd internal error: Device.Ioemu_failed("vgpu exited unexpectedly")

In XenCenter, the log tab will display the following entry:

vGPU error displayed when starting VM

vGPU error displayed when starting VM

In this article, I’ll describe some steps troubleshoot vGPU issues and possible solutions to fix failing vGPU-enabled VMs.

XenServer Console

vGPU installation

First thing to check, is if the vGPU installation is OK. By using a few commands, you can verify if the installation is OK.

First of all, the NVIDIA Virtual GPU Manager RPM package should be installed on the XenServer. You can check this by opening a either console connection using XenCenter, or by connecting to SSH using a tool like Putty. Once logged on as a user with administrative permissions (eg. root), you can execute this command:

rpm --query --all | grep -i nvidia

This should output one line containing the NVIDIA Virtual GPU Manager version:

[root@localhost ~]# rpm --query --all | grep -i nvidia
NVIDIA-vgx-xenserver-6.2-340.57

Next up is to check if the NVIDIA kernel module is loaded on the XenServer. This can be done by using the following command (again on the server’s console, or using a tool like Putty):

lsmod | grep -i nvidia

This should output at the following:

[root@localhost ~]# lsmod | grep -i nvidia
nvidia               9522927  14 
i2c_core               20294  2 nvidia,i2c_i801

Last step is to check if the NVIDIA System Management Interface runs. Again, in the XenServer’s console, execute the following command:

nvidia-smi

This should output something like the following:

[root@localhost ~]# nvidia-smi
Thu Jan 15 12:32:48 2015       
+------------------------------------------------------+                       
| NVIDIA-SMI 340.57     Driver Version: 340.57         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K2             On   | 0000:08:00.0     Off |                    0 |
| N/A   39C    P8    27W / 117W |     11MiB /  3583MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GRID K2             On   | 0000:09:00.0     Off |                    0 |
| N/A   39C    P8    27W / 117W |     11MiB /  3583MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Compute processes:                                               GPU Memory |
|  GPU       PID  Process name                                     Usage      |
|=============================================================================|
|  No running compute processes found                                         |
+-----------------------------------------------------------------------------+

If one of these steps does not output something like the examples, there could be something wrong with the vGPU installation. Best would be to remove any vGPU installation and reinstall the latest version.

Updating / reinstalling the vGPU driver is described in one of my other articles: Upgrading the NVIDIA GRID vGPU driver on XenServer.

Server / VM driver version

The driver installed inside the VM and the driver installed on the XenServer itself should be from the same driver package downloaded from the NVIDIA website. If these version do not match, you could be facing issues like failing VMs or failing display devices inside your VM.

If you look at the current (Jan 15th 2015) downloadable driver (for XenServer 6.2) from the NVIDIA website, the filename is “NVIDIA-GRID-vGPU-XenServer-6.2-340.57-341.08.zip”. This means version 340.57 for XenServer VGX driver/Management Interface and driver version 341.08 for your Windows VM.

To check which driver is being used on the XenServer itself, execute the following command from the XenServer’s console:

nvidia-smi --query | grep -i "driver version"

This would output the installed driver version on XenServer:

[root@localhost ~]# nvidia-smi --query | grep -i "driver version"
Driver Version                      : 340.57

Now, log in to the VM to check the display adapter and the installed driver. To check the display device, open the Device Manager and expand the “Display” adapters:

Device Manager Display Adapters

Device Manager Display Adapters

Double click the NVIDIA GRID display adapter to open the device status:

NVIDIA GRID display adapter

NVIDIA GRID display adapter

In this case, the display adapter shows that it is working properly. To check the installed version of the NVIDIA GRID display adapter driver, open the “Add/Remove Programs” wizard, or “Programs and Features” wizard via the Control Panel. Search for the “NVIDIA Graphics Driver”, this will display which driver version is installed in the VM itself:

NVIDIA GRID display driver

NVIDIA GRID display driver

If the VM’s driver version does not match the XenServer’s driver version one of those (or both :)) need to be upgraded. Again, I’ve written how to do this in my article “Upgrading the NVIDIA GRID vGPU driver on XenServer“.

vGPU Configuration

Another possible error message in the VM’s log tab could be the following:

Jan 16, 2015 11:34:33 AM Error: Starting VM 'xxx' - vGPU type is not compatible with one or more of the vGPU types currently running on this pGPU

The message could be a bit confusing. But usually this can mean 2 things:

  1. There are no physical GPUs available at all to start this VM
  2. There is room on one of the GPUs, but mixed-vGPU-profile configuration prevents the VM from starting

vGPU capacity

To check if there are GPUs available to host the vGPU-enabled virtual machine, you just can just open the GPU tab in XenServer:

XenServer GPU tab to check available capacity

XenServer GPU tab to check available capacity

In this picture, you can see that all GPUs are in use. Every next VM that will be started will give the “vGPU type is not compatible with one or more of the vGPU types currently running on this pGPU” error message.

In this case, possible solutions are:

  • Add another GPU (GRID card) to the physical server
  • Lower the number of VMs per-host
  • Change the vGPU profile to a “lighter” profile (eg. from K260Q to K240Q) to allow more VMs per GPU
NOTE ABOUT MCS (Machine Creation Services) in XenDesktop: if you don’t have enough GPUs available and you’re creating a new machine catalog for VMs which have vGPU enabled, the creation will fail. Part of the MCS process is to boot the golden image once, if there’s no room on the GPU, booting the image will fail and the MCS catalog creation will roll-back. So prior to creating your machine catalog with MCS, make sure you have enough GPU capacity.

Mixed vGPU profiles

If you have VMs with different configured vGPU profiles, it could happen that you also run into the “vGPU type is not compatible with one or more of the vGPU types currently running on this pGPU” error message.

XenCenter GPU Capacity with mixed vGPU profiles

XenCenter GPU Capacity with mixed vGPU profiles

As seen above in the GPU tab of XenCenter, there is still a lot of GPU capacity available. The problem here is the fact that one GPU can’t host mixed-vGPU-profiles. In other words: one physical GPU will only host one type of vGPU profile. In this case, the first GPU is hosting a VM with a K220Q profile, while the second GPU is hosting 2 VMs with a K200 profile. If the next VM which is started uses a vGPU profile other than a K200 or K220Q profile, it will not be able to start.

The maximum number of different profiles you can use is equal to the total number of physical GPUs in your host. (not Graphics cards / GRID boards, but physical GPUs on the boards itself)

Again, possible solutions:

  • Add another GPU (GRID card) to the physical server to accomodate more vGPU profiles
  • Lower the number of different vGPU profiles

Placement Policy and Profiles

Continuing on the previous chapter (mixed vGPU profiles), there are 2 configuration options in XenCenter which could influence the total number of different vGPU profiles per GPU:

  1. Placement policy
  2. Allowed vGPU types

By default, the settings for both of these configuration options allow the maximum number of vGPU profiles per GPU. But if these configurations have been modified, you can run into errors.

First of all, the placement policy. This tells XenServer on which GPU the VM should be placed if it’s started. This setting can be modified by opening the host properties and opening the “GPU” tab:

XenServer GPU Placement Policy

XenServer GPU Placement Policy

  • Maximum density
    Every newly started VM is placed on the GPU which already hosts the same vGPU profile (if applicable). If there’s no GPU which is hosting the same vGPU profile, it will be started on an empty GPU. This is the default configuration.
  • Maximum performance
    Every newly started VM is placed on an empty GPU if possible. This configuration does allow for the best performance, but if you’re using multiple vGPU profiles, you can run into issues.
    Simple example: if 2 VMs are started with the same K220Q profile, the both will claim one GPU. If the next VM which starts, doesn’t use a K220Q profile, it can’t be assigned to any GPU and the VM can’t be started.

So if you configured “Maximum performance” and you’re getting errors, check if you can change the placement policy to “Maximum density“.

Next is the “Allowed vGPU types” configuration. This feature can be accessed from the GPU tab in XenCenter. Again, the default configuration allows all types of vGPU profiles on every GPU (if compatible ofcourse :)). But if you modified these settings, to eg. only allow K260Q profiles on the GPUs, a VM with a K240Q vGPU profile will not be able to start.

XenCenter allowed vGPU profiles per GPU

XenCenter allowed vGPU profiles per GPU

If you look at the example above, you can see that on the selected GPU only VMs with a K200 vGPU profile are allowed to run. Any VM with another vGPU profile will not start. To fix this, you should make sure that you can start the specific vGPU profiles at least on one GPU.

Memory

ECC /  Non-ECC memory

If you have a host which has ECC memory installed, the NVIDIA System Management Interface needs to be configured for ECC memory support (and vice versa). If there’s a mismatch between ECC support configuration, your vGPU-enabled VMs can fail with an “vgpu exited unexpectedly” error.

You can check if ECC support is disabled or enabled by using the following command:

nvidia-smi --query | more

Look for the ECC mode part:

    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled

To change the ECC mode of the NVIDIA System Management Interface, use the “–ecc-config” parameter in the nvidia-smi command. To enable ECC memory support, use:

nvidia-smi --ecc-config=1

To disable ECC memory support (you probably guessed it :)):

nvidia-smi --ecc-config=0

After running this command, a reboot of your XenServer is needed.

Dynamic VM memory

XenServer has a nice feature which allows you to save memory. You can configure a memory range which should be allocated for a VM (minimum and maximum). This feature will allocate the maximum amount of memory at first, but will free host memory when possible.

XenCenter Dynamic Memory

XenCenter Dynamic Memory

Long story short, dynamic memory and vGPU don’t mix. Dane Young (twitter @youngtech) has written an excellent article about it on his blog: XenServer 6.2 Dynamic Memory and NVIDIA GRID vGPU, Don’t Do It! Note that the VM can start without issues, but you could run into problems if dynamic memory is enabled (read Dane’s article :)).

Additional Update: both Dane Young and Thomas Poppelgaard informed me about the fact the Dynamic Memory and vGPU does work in XenServer 6.5. So if you’re using XenServer 6.5, you can ignore this chapter. To get more information, read Dane’s blog: XenServer 6.5 Dynamic Memory and NVIDIA GRID vGPU? Now Fixed in 6.5! Go For It!

Additional checks

Checking logfiles

Basically, the most important logfile to check for errors is “/var/log/messages”. For example, if the ECC support is configured incorrectly, you will find a line in the “messages” logfile containing something like this:

Jan 19 16:09:26 localhost fe: vgpu-1[10160]: vmiop_log: error: Initialization: VGX not suported with ECC Enabled.

In the “messages” logfile, look for lines that start with “vmiop”:

grep -i "vmiop" /var/log/messages

Bug report

You can search and open the logfiles mentioned individually. If you want to have a file with all relevant logfiles and output consolidated, you can use the bug report command:

nvidia-bug-report.sh

The command will gather all relevant logfiles, output, information and make one compressed file called “nvidia-bug-report.log.gz” and stores it in the directory where you execute the command. You can use something like WinSCP to download it, extract it using eg. 7zip, and analyze the logfile. This file contains a lot of valuable information about your environment.

If everything fails…

If nothing of this helps, you could be facing a hardware issue. Best next step would be either to call Citrix or NVIDIA support! 🙂

I hope this article was of any use for you. Feel free to contact me through email, or leave a comment below.

Comments
  1. 2 years ago
    • 2 years ago
  2. 2 years ago
  3. 2 years ago
    • 2 years ago

Leave a Reply

Your email address will not be published. Required fields are marked *

Complete the following sum: * Time limit is exhausted. Please reload CAPTCHA.