Backtrack:  
 
by lunarg on December 9th 2019, at 16:54

I had an issue where vMotion would no longer work. When selecting the target host in the vCenter UI, the compatibility check would fail with the error:

Error
A general system error occurred: Connection refused: The remote service is not running OR is overloaded, OR a firewall is rejecting connections.

Background

A Google search for the issue reveals quite a bit of possible causes, mostly pointing to several more standard causes, which were all in order. When digging deeper in the logs, I stumbled upon the same message appearing in /storage/log/vmware/vmware-vpxd/vpxd.log.

Continuing the search, someone mentioned that it could be caused by services not started, which can easily be revealed when logging into the VCSA through SSH, and running service-control --status --all from the command-line. That someone was right:

# service-control --status --all
Stopped:
 vmcam vmware-imagebuilder vmware-mbcs vmware-netdumper vmware-rbd-watchdog vmware-sps vmware-statsmonitor vmware-updatemgr vmware-vcha vsan-dps
Running:
 ...

Note: output truncated for readability

I compared the output to that of a "healthy" vCenter and noticed some services (vmware-sps and vmware-updatemgr) not started. Manually attempting to start the services did not help: the starting process kept hanging indefinitely. Looking into the logs of the service (/storage/logs/vmware-sps/sps.log) revealed another error (again, I truncated the output):

storage/logs/vmware-sps/sps.log
java.lang.IllegalStateException: Client initialization is not complete!

After some more searching, I came across this article which turned out to be the solution:

https://www.reddit.com/r/vmware/comments/dumb15/certificates_were_updated_and_now_sps_and_update/

The article stated that there's an issue in the database: if there are multiple entries for the SSO admin account in a particular table (vpx_access), it would cause the vmware-sps service not to start.

And sure enough, although nothing was updated (no patches, no certificates), it turned out that there were indeed multiple entries for the SSO admin present in the database. After removing the surplus and restarting all services, all services could start properly and the issue was resolved.

Resolution

The steps I took to resolve the issue:

  1. Enable shell access (more info)
  2. Log on to the PostgreSQL database shell (more info):
    /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres
  3. Run the following query to check whether there are more than 1 SSO admin accounts present:
    SELECT * FROM vpx_access;
    Look for the entries for the SSO admin (for example: VSPHERE.LOCAL\Administrator). You should get output similar to:
    VCDB=# SELECT * FROM vpx_access;
     id |          principal          | role_id | entity_id | flag | surr_key
    ----+-----------------------------+---------+-----------+------+----------
      1 | VSPHERE.LOCAL\Administrator |      -1 |         1 |    1 |        1
    (1 row)
    If you see more than one row for the SSO admin, then there's the issue. All rows, except the one with id = 1 should be removed.
  4. Remove the extra rows where the principal = VSPHERE.LOCAL\Administrator, leaving only the one with id = 1. Be careful how you construct your SQL query:
    DELETE FROM vpx_access WHERE principal = 'VSPHERE.LOCAL\Administrator' AND id <> 1;
  5. After running the query, re-run the SELECT statement to verify there's only one account left.
  6. Restart all services:
    service-control --stop --all
    service-control --start --all
    It will take some time to stop the services and start them again. Alternatively, you can also reboot the appliance.

After correcting the database, all services started without problems and functionality was restored.

Conclusion

If you get the particular error message mentioned at the start of this article, the solution presented here may not resolve it, but it's worth checking before you move on to researching other possible causes.