-
Notifications
You must be signed in to change notification settings - Fork 55
Troubleshooting Harvest
A set of steps to go through when something goes wrong.
Run the following, replacing <poller> with the poller from your harvest.yaml
./bin/harvest zapi -p <poller> show system
Copy and paste the output into your issue. Here's an example:
./bin/harvest -p infinity show system
connected to infinity (NetApp Release 9.8P2: Tue Feb 16 03:49:46 UTC 2021)
[results] - *
[build-timestamp] - 1613447386
[is-clustered] - true
[version] - NetApp Release 9.8P2: Tue Feb 16 03:49:46 UTC 2021
[version-tuple] - *
[system-version-tuple] - *
[generation] - 9
[major] - 8
[minor] - 0
I tried to install and ...
You believe Harvest is installed fine, but it's not working.
- Post the contents of your
harvest.yml
Try validating your harvest.yml with yamllint like so: yamllint -d relaxed harvest.yml
If you do not have yamllint installed, look here.
There should be no errors - warnings like the following are fine:
harvest.yml
64:1 warning too many blank lines (3 > 0) (empty-lines)
-
How did you start Harvest?
-
What do you see in
/var/log/harvest/* -
What does
ps aux | grep pollershow? -
If you are using Prometheus, try hitting Harvest's Prometheus endpoint like so:
curl http://machine-this-is-running-harvest:prometheus-port-in-harvest-yaml/metrics
Use the --debug flag when starting a poller. In debug mode, the poller will only collect metrics, but not write to databases. Another useful flag is --foreground, in which case all log messages are written to the terminal. Note that you can only start one poller in foreground mode.
Finally, you can use --loglevel=1 or --verbose, if you want to see a lot of log messages. For even more, you can use --loglevel=0 or --trace.
Example:
harvest start my_poller --foreground --debug --loglevel=0
which is equal to:
harvest start my_poller -fdt
See How do I start Harvest in debug mode?
Since a poller will start a large number of collectors (each collector-object pair is treated as a collector), it is often hard to find the issue you are looking for in the abundance of log messages. It might be therefore useful to start one single collector-object pair when troubleshooting. You can use the --collectors and --objects flags for that. For example, start only the ZapiPerf collector with the SystemNode object:
harvest start my_poller --collectors ZapiPerf --objects SystemNode
(To find to correct object name, check conf/COLLECTOR/default.yaml file of the collector).
The logs show these errors:
context deadline exceeded (Client.Timeout or context cancellation while reading body)
and then for each volume
skipped instance [9c90facd-3730-48f1-b55c-afacc35c6dbe]: not found in cache
context deadline exceeded (Client.Timeout or context cancellation while reading body)
means Harvest is timing out when talking to your cluster. This sometimes happens when you have a large number of resources (e.g. volumes).
There are a few parameters that you can change to avoid this from happening. You can do this by editing the subtemplate of the resource affected. E.g. you can add the parameters in conf/zapiperf/cdot/9.8.0/volume.yaml or conf/zapi/cdot/9.8.0/volume.yaml. If the errors happen for most of the resources, you can add them in the main template of the collector (conf/zapi/default.yaml or conf/zapiperf/default.yaml) to apply it on all objects.
Increase the client_timeout value by adding a client_timeout line at the beginning of the template, like so:
# increase the timeout to 60 seconds
client_timeout: 60Decrease the batch_size value by adding a batch_size line at the beginning of the template. The default value of this parameter is 500. By decreasing it, the collector will fetch less instances during each API request. Example:
# decrease number of instances to 200 for each API request
batch_size: 200If nothing else helps, you can increase the data poll interval of the collector (default is 60s for ZapiPerf and 180s for Zapi). You can do this either by adding a schedule attribute to the template or, if it already exists, by changing the - data line.
Example for ZapiPerf:
# increase data poll frequency to 2 minutes
schedule:
- counter: 1200s
- instance: 600s
- data: 120sExample for Zapi:
# increase data poll frequency to 5 minutes
schedule:
- instance: 600s
- data: 300s