This article covers what happens when things go wrong.

Backup and Recovery

Managing backup and recovery of the StifleR server is very simple.

Backup

There are only a few files that require backup on the StifleR server.

From the StifleR server installation folder:

  • The main service configuration file (StifleR.Service.exe.config)

From the ProgramData\StifleR folder:

  • LocationData.xml

  • JobsData.xml.

Since these files are small it is worthwhile to back them up locally several times a day and then at least once daily to a remote location.

Recovery

Using the backed up files the entire StifleR server can be re-installed in a couple of minutes.

  1. Build a new server

  2. Install the latest version of StifleR

  3. Stop the StifleR Service

  4. Copy the configuration file (make a backup copy of the installed one) and merge with the backup files.

  5. Check the installation notes on version upgrades and modify data if needed per the new version installation guide.

  6. Create new copies from your backup versions of the LocationData.xml and the JobsData.xml with any modifications required.

Client exits or service is Stopped

If the client is in connected mode and the client is Red Leader it will instruct the server that is disconnecting, remove the Red Leader role and disconnect it’s SignalR connection to the server. It will then reset the BITS and DO policy values to whatever is configured in the DefaultDisconnectedBITSPolicy and DefaultDisconnectedDOPolicy values in the StifleR.ClientApp.exe.config file.

If this is value is set to 0 all downloads will be stopped. If it set to 18446744073709551615 (ulong max value) it will remove the policy completely. By default, this value is set to 64 which will configure a Maintenance policy that allows the client to use 64Kbp/s. Therefore, in the case of total server, loss clients will use 64Kbp/s each, but the effective number is likely to be lower as they will fall back to its regular behavior which is fairly decent at dealing with multiple concurrent downloads as long as they don’t start at the same time.

If the client is in standalone mode, changes to the policy are never made and therefore there is nothing to reset.

You can also track non-running services via the event log, triggering an action if the client is not running.

Bad Leader Detection

The solution builds on Two-Way communication between Red Leader and Server. If traffic is flowing only one way, actions are triggered on the server to replace the Red Leader.

This is built around the fact that the Red Leader must accept nomination and return back acceptance. Once the Leader has accepted the nomination the following are set to true on the Client instance on the Server:

  • ReportedRedLeader - Boolean

  • ReportedBlueLeader - Boolean

If ReportedRedLeader=False is detected then it would be replaced. The Server detects unresponsive Red Leaders in a configurable time and flags the Leader as bad. This can happen when the Client disconnects or has stopped reporting.

The client will detect that the server is no longer communicating with it and un-assigns itself. The Server doesn’t care as it will have already assigned another Red Leader.

Client Can’t Connect to Server (network issue etc.)

If the client cannot connect to server at all, due to networking roaming or other issues, the client will try to connect to the next server in the list. Failing to connect to a valid server causes the Client to stall and wait until it can detect a valid Windows Network ID.

Client Crashes

If the client crashed it may or may not be able to reset its BITS policy to the configured value. It is best to allow the service automatically re-start itself and correct the wrong value. At service startup it will reset its BITS policy to the default low value.

Server Crashes

In the case of a complete Server loss the clients would all be disconnected and try to re-connect to the server that they were connected to. You can set the client config value of SleepBeforeSwitchingServer to wait before trying the last known good server.

The client will wait for a random of between 1 (Configurable via the SleepBeforeSwitchingServer) minute and 5 minutes (Not configurable) before reconnecting back.

If the last known server is not reachable, the next server in the Client’s configuration list will be tried. To avoid overwhelming the new server the clients wait for a period of 1(SleepBeforeSwitchingServer) to 5 minutes (Hard limit) before connecting to the new server.

In case of running the clients in (Legacy) Event driven mode, if the first server in the list is down this may lead to very slow progress of downloads as the second server to connect to might not be tried until after five minutes.

If you don’t have a warm standby server, the clients will still wait between one and five minutes before reconnecting back to the server. In the meantime, all Red Leaders will un-assign themselves and all clients and leaders will set the DefaultDisconnectedBITSPolicy and  DefaultDisconnectedDOPolicy values ( 64Kb/s.)

Once the server is back up and running the clients will reconnect, configure and resume their regular tasks as before.