Troubleshooting Backend Failures
This document outlines possible areas of failure and how they impact users
etcd is used by API and Core to determine the addresses of running Core services. As well as to determine where a module is executing in a cluster.
- Some drivers may not function as the API won't be able to reach them
- Core instances won't be able to join the cluster and won't launch drivers
- Most API requests will continue to work, execute and WebSocket requests will fail
- Triggers will fail
RethinkDB holds cluster configuration, primarily used by core during initialization.
- Most API requests will stop functioning, including authentication
- Existing WebSocket requests will continue to function
Redis holds the runtime state of the cluster, such as module metadata and module status
- Status changes and signals will stop functioning
- Errors raised in modules may prevent some execute requests from functioning properly
- Execute APIs and cross driver communication will be effected as module metadata will be unavailable
- API index / listing / searching requests will fail - this will be mostly apparent when using Backoffice
- All other aspects of the system will continue functioning
- Executes directed at a module running on the driver will fail
In the event of a failure, being able to isolate which aspect of the system is not functioning is key to a quick recovery.
- Can you log-in? If not it's probably a RethinkDB Failure or load balancer issue (check if requests are hitting the services)
- Does Backoffice list the systems? If not it's probably an Elasticsearch failure
If you can login and see systems:
- Select a system you can safely use for testing
- Does the list of modules and module functions load? If not it's probably a Redis Failure
- Does executing a function work from Backoffice? View the response to see the error:
- If the error says
no core instancesthen core might be down or unable to connect to etcd
- If the error says
unable to connect to etcdthen etcd might be down or the API can't connect to etcd
This should help you identify the cause of most issues.
All the services support being toggled to output more verbose logs which might help shed light on an issue.
To toggle log levels you send a signal to the service:
kill -USR1 <process_pid>
As you typically cannot shell into the containers, you'll have to run this command from the host. You must be running as root and you can find the process pid by running:
ps aux | grep <staff|api|triggers|etc>