A comprehensive CheckMK plugin for monitoring SMART error counters on storage devices. This plugin collects detailed error statistics from drives and provides threshold-based alerting and graphing.
- Comprehensive Error Monitoring: Tracks corrected and uncorrected errors across read, write, and verify operations
- Detailed Metrics: Provides both absolute error counts and error rates per TB processed
- Flexible Thresholds: Configure warning and critical levels for different error types
- Rich Graphing: Six detailed graphs showing error trends and statistics
- Agent Bakery Support: Automatic deployment and configuration via CheckMK GUI
- Cross-Platform: Works on Linux systems with smartmontools
- CheckMK 2.3.0 or later
smartmontools
package installed on target hosts- Storage devices supporting SCSI error counter logs (most enterprise drives)
Get the latest version from the GitHub releases page.
Each release includes:
- Pre-built MKP package (recommended)
- Source code
- Release notes with changelog
The easiest way to install the plugin is using the pre-built MKP package:
-
Download the MKP package from the releases page
-
Install via CheckMK GUI:
- Navigate to Setup → Extension packages
- Click Upload package
- Select the downloaded
oposs_smart_error-X.Y.Z.mkp
file - Click Install
-
Or install via command line:
mkp add oposs_smart_error-X.Y.Z.mkp
If you prefer to install from source, copy the plugin files to your CheckMK site:
cp -r local/* ~/local
The directory structure should look like:
~/local/lib/python3/cmk_addons/plugins/oposs_smart_error/
├── agent_based/
│ └── oposs_smart_error.py
├── checkman/
│ └── oposs_smart_error # Plugin documentation
├── graphing/
│ └── oposs_smart_error.py
└── rulesets/
├── oposs_smart_error.py # Check parameter rules
└── ruleset_oposs_smart_error_bakery.py # Bakery configuration rules
~/local/lib/python3/cmk/base/cee/plugins/bakery/
└── oposs_smart_error.py # Bakery plugin logic
~/local/share/check_mk/agents/plugins/
└── oposs_smart_error # Agent plugin script
After installing the MKP package or manually copying files:
-
Restart CheckMK to load the new plugins:
omd restart apache
-
Agent Plugin Deployment:
- With MKP installation: The agent plugin is automatically available for bakery deployment
- For manual deployment: Copy the agent plugin to target hosts:
cp local/share/check_mk/agents/plugins/oposs_smart_error /usr/lib/check_mk_agent/plugins/ chmod +x /usr/lib/check_mk_agent/plugins/oposs_smart_error
- For Agent Bakery deployment (recommended): The bakery will automatically deploy the plugin
- Go to Setup > Agents > Agent rules
- Create a new rule for OPOSS SMART Error Monitoring (Linux)
- Configure options:
- Enable: Enable/disable monitoring
- Timeout: Command timeout in seconds (default: 30)
- Interval: Execution interval in seconds (0 = every agent run)
- Assign the rule to target hosts
- Go to Setup > Agents > Agent bakery and bake agents
If not using Agent Bakery, manually copy the agent plugin to target hosts:
scp local/share/check_mk/agents/plugins/oposs_smart_error root@target-host:/usr/lib/check_mk_agent/plugins/
ssh root@target-host chmod +x /usr/lib/check_mk_agent/plugins/oposs_smart_error
- Go to Setup > Hosts
- Select target hosts and run Service discovery
- SMART services will be discovered for each drive with error counter logs
- Accept the discovered services
- Go to Setup > Services > Service monitoring rules
- Create a new rule for OPOSS SMART Error Monitoring
- Configure thresholds for specific error counter types:
- Uncorrected Errors (Absolute): Warning/Critical for uncorrected errors (default: 1, 10)
- ECC Fast Corrected Errors (Absolute): Warning/Critical for fast ECC corrections (default: 10000, 100000)
- ECC Delayed Corrected Errors (Absolute): Warning/Critical for delayed ECC corrections (default: 1000, 10000)
- Rereads/Rewrites Corrected Errors (Absolute): Warning/Critical for reread/rewrite corrections (default: 100, 1000)
- Correction Algorithm Invocations (Absolute): Warning/Critical for algorithm invocations (default: 50000, 500000)
- Uncorrected Errors per TB: Warning/Critical for uncorrected error rate per TB (default: 0.1, 1.0)
Each monitored drive will show:
- Device model, capacity, and serial number
- Breakdown of specific error correction types (ECC fast, ECC delayed, rereads/rewrites)
- Uncorrected errors count
- Amount of data processed
- Current service state based on configured thresholds
The service name format is "OPOSS SMART Errors <device_path>", for example "OPOSS SMART Errors /dev/sda".
Example service output:
WDC WD4003FZEX-00Z4SA0 (4.00 TB) S/N: WD-WMC130: Corrected: 125 ECC fast, 31 ECC delayed, 12.3 TB processed
The plugin provides detailed metrics for each operation type (read, write, verify):
- Errors corrected by ECC fast
- Errors corrected by ECC delayed
- Errors corrected by rereads/rewrites
- Total errors corrected
- Correction algorithm invocations
- Bytes processed
- Total uncorrected errors
- All above metrics calculated per TB of data processed
- Read Operations - Error Counts: Absolute read error metrics
- Write Operations - Error Counts: Absolute write error metrics
- Verify Operations - Error Counts: Absolute verify error metrics
- Read Operations - Relative (per TB): Read error rates
- Write Operations - Relative (per TB): Write error rates
- Verify Operations - Relative (per TB): Verify error rates
Check that:
smartctl
is installed on the target host- Drives support SCSI error counter logs
- Agent plugin is executable and in the correct location
- Plugin produces output:
/usr/lib/check_mk_agent/plugins/oposs_smart_error
- "Failed to get SMART data": Check drive accessibility and smartctl permissions
- "No error counter data available": Drive doesn't support SCSI error logging (common with consumer SSDs)
- "Agent plugin source file not found": Bakery can't find the agent plugin file
Test the agent plugin directly on target hosts:
# Run the plugin manually
/usr/lib/check_mk_agent/plugins/oposs_smart_error
# Check for output
check_mk_agent | grep -A 10 "<<<oposs_smart_error>>>"
# Test the plugin before deployment
python3 local/share/check_mk/agents/plugins/oposs_smart_error
# Check for plugin errors
tail -f ~/var/log/web.log
tail -f ~/var/log/cmc.log
The plugin works with storage devices that support SCSI error counter logs:
- Most enterprise SATA/SAS hard drives
- Many enterprise SSDs
- Some consumer drives (varies by manufacturer)
Devices that typically don't work:
- Consumer SSDs without SCSI error logging
- USB-connected drives
- Some NVMe drives (depending on SMART implementation)
If no custom thresholds are configured:
- Corrected errors: Warning at any corrected errors
- Uncorrected errors: Critical at any uncorrected errors
This project is licensed under the MIT License - see the LICENSE file for details.
Tobi Oetiker tobi@oetiker.ch
- Issues: GitHub Issues
- Documentation: See plugin checkman documentation after installation
- Releases: GitHub Releases
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request