All versions of ESX and ESXi, including the free-licensed option, (edit – script tested against ESXi v5 too) include good hardware health monitoring capabilities right out-the-box, which are extended even further by the vendor specific builds from IBM, HP and Dell for supported hardware. Having this visibility is key to operating a stable and reliable infrastructure, but is of little use without some automatic alerting.
For vCentre managed hosts this of course is no problem with its extensive range of alerting capabilities. But for standalone hosts, there isn’t even a basic email-on-alert function.
Fortunately though the health status interfaces are easily accessible using a perl script, and William Lam has developed just such a script available via his Ghetto Blog, here.
esx-health.pl
I’ve extended and customised this script a little by adding CPU and RAM usage, some colour coding, overall status in the email header and by tweaking the layout a little to my needs. My version is available in the wiki – code here and documentation here.
The script needs vmware’s vSphere CLI installed, and is run from a command line which looks like this:
Creating a Daily Status Email
The key requirement for me is to have the collection of health data automated. I run a ‘management-server’ VM, a Windows guest, which deals with keeping everything running smoothly (for example, UPS shutdown, WiFi access point scheduled restarts, Internet connection failure detection and restarting, backing up Linux guests via a Windows NFS Share, and so on) so have simply added a Windows scheduled task to run the health check script every morning and email me a report.
Worth the Effort?
Server reliability, especially disk reliability, seems to have got considerably better in the last few years. However my home lab box (which is serving these pages) did recently have a disk failand this script alerted me to this fact within a hours. As this server has internal drives and the Perc-5i controller which lacks any audio failure warning, without the script this could have gone unnoticed (and perhaps for months). Unfortunately being an ML115, it doesn’t report any other hardware status such as fan, PSU and RAM errors, but still, potentially saved me downtime so definitely worth the effort to me.
In Summary
ESX and ESXi have superb hardware health monitoring capabilities built right in, but require vCentre to provide status alerting. By scheduling a development of William Lam’s hardware monitoring script, this limitation has been avoided and in this case a disk failure acted upon within hours.
Because of the hardware monitoring (with alerting with this script) and the simple network redundancy that is transparent to guests, I’m finding more and more use for ESXi. Even for a single purpose box, I’d rather user ESXi as the base OS and run the OS in a VM. I’m working on a blog detailing building a completely free shared storage solution for ESX based on a Debian VM on ESXi for high-performance shared storage (which will also cover a completely free ESXi backup solution).
About the Blog - Contact the Author - If you found this useful, why not leave a comment?


This is really great content. I enjoy reading your Blog. Please keep posting.
Again an excellent article James! I’ve download the script and placed it on my management VM and it runs like a dream..!
Many thanks..!
Hi Aleks, Glad you found it useful – and thanks for leaving a comment.
this script is handy, but I found the numeric values for ‘other system components’ like fan speed and temperature values were out by a factor of -200! eg. a temperature was reporting as -8200 Degrees C instead of an expected 41 degrees C as shown for the same value when viewed in vSphere Client. I think line 237 :
my $reading = $_->currentReading * $_->unitModifier;
should be:
my $reading = $_->currentReading * 10 ** $_->unitModifier;
unitModifier seemed to be -2 for me. It dawned on me that it was there to move the decimal point in the currentReading.
Cheers.
Hi Murray, I tried your code mod on a few Dell servers and it returns the correct values in all cases (my test ML115 doesn’t report these sensors hence why I’d missed the error). So many thanks indeed for posting this, I have updated the code in the wiki. Cheers
James,
Is there an easy way to edit the script to tell it to not worry about memory usage up to % xx..? (say 92 or 95%?)
My HP ML 110 G5 can only hold 8 GB so my memory usage is always around 86%/90% which makes the script always give an alert every day. I don’t want to get used to a warning/error each day because it might make me blind to the fact that something might really be wrong (cpu usage or disk)
I tried changing the “green” into “red” but that doesn’t seem to help..
TIA!
Hi Aleks, actually I’ve been having this issue too so it’s quite a timely request. I’ve added command-line options –memwarnpc and –cpuwarnpc to the code – does this help? By the way, 512MB RAM can be freed up by reducing the reservation on host/vim/vmvisor to 256MB (system resource allocation, advanced view). I’ve also read that the ML110 (not the 115) can take 16GB, if you can stretch to it!
Hi again James,
I did notice that you put in my suggested mod for the correct use of the unitModifier variable, but since your latest mod for memwarn and cpuwarn, you have reverted back to how it was before, WRT unitModifier. Was there a problem?
Ta,
Murray
Hi Murray, sorry about that, I worked from an out-of-date version in my hurry to complete some coding before Top Gear! Now sorted, thanks very much for checking. Cheers, James.
Hi James,
B-R-I-L-L-I-A-N-T! It works…!!!!
Please James, can you tell me where you read that the ML110 can support 16 GB..?
The memory reclaim of 512mb sounds like a good idea
http://www.vm-help.com/esx40i/memory_allocation.php
I first want to upgrade my production box to 4.1 before I change this.
@murray
I entered your mod into the updated script and it works fine so I think James will update it shortly..!
Hi Aleks, glad it’s working for you. My bad with the RAM – the ML110G6 seemingly works with 4GB DIMMs (unsupported) – see the comment by Ron here – but the 3200 chipset in the G5 is 8GB addressable only, sorry about that.
No worries mate..! Your 512 MB tip will probably help me to reduce memory usage just enough to a point where I can survive the coming 2-3 years I hope
Thanks again for modding your health script..!
[...] Obviously reliability is a concern. For better spec hardware that can run ESXi (as in my DR site scenario), using ESXi as the base OS provides excellent health (and performance) monitoring out-the-box, which can be periodically checked using a reporting script such as esx-health.pl. [...]
Hi,
The script works great on our PowerEdge 2800 servers! However on our T300 servers ESXi health status reports ‘Power – Power Supply 1 and Power Supply 2′ as Unknown.
This create 2 alerts everytime.
o Power Supply 1: 0 Watts
o Power Supply 2: 0 Watts
Would it be possible to have an option to skip this alert?
Hi Tim, thanks for posting with this idea. I’ve added an optional parameter “–exclude”, after which you can add anything you like and it will check each line, ignoring any that contains the specified text. So in your example you should be able to write:
–exclude “Power Supply”
to surpress reporting (and alerting) against the two lines.
Could you give it a go and post back?
Cheers, James.
Hi James,
Sorry for the delayed response. I’ve just tried the script and it skipped the power supply alerts.
Thanks!
hello, is it possible to exclude multiple lines? I’ve got a couple of systematic errors, fan and chassis alert intrusion, and I’d like to shut them both off, can you help me? btw awesome script!
Hi, yes you can include a Perl regular expression as that parameter. So seperate each part with a pipe character (“|”) meaning “OR”, for example:
–exclude “fan|intrusion”
Hope that helps! NB an updated version of the script should be up in a couple of days with a few bug fixes.
will try asap, thanks!!!
Wow,
thanks man this is such a good script,
Albert
Hello,
Thanks for the GREAT script, works like a charm! Very useful!
A question: I would like to crontab this (Linux for Task Scheduler) to run every 15 minutes, and IF there’s an alert send me a mail. It should not continue to send mail every 15 minute because there STILL is an alert, but instead only send mail again if the alert count has gone to zero and back up to above zero.
I guess this is something more users would like to see, as we then could receive the health reports weekly/monthly and have the script mail me directly if it finds an alert.
Is it doable?
Thanks in advance!
Hi Jonas, thanks for leaving this idea. I’ve added in an option “–warnonchange” which I think does what you are looking for. Be aware that it only tracks the total number of alerts though. Please let me know if that is what you meant! Cheers, James.
Thanks for the quick reply and the new functionality! I have scheduled one script with the –warnonchange-switch to run every 20 minutes, and one without the switch to run weekly.
It seems to work, but I’ll try to generate some alerts next time I get to the server to make sure.
I did a small change to the sendmail-function to allow it to send mail to several receipients:
@tos = split(‘,’, $EMAIL_TO);
And then, to add all mail addresses, I did this:
foreach () { smtp->to($_); }
Maybe you could add that to the script for other to use?
Thanks for all help!
I tested generating an alert, and it works just fine. I received an alert e-mail directly, and another one when the alert count went back to zero.
Great work! Thanks!
Hi Jonas, I put this into the latest version, many thanks for this idea
Just noticed that in my foreach-loop the paranthesis became empty after post (because of the arrows in it). It should read:
foreach (<@tos>) { smtp->to($_); }
But I guess you could figure that out
Hi Jonas, many thanks for posting that – looks like a good idea to me.
Yes, I am agree about your your thought. But now most of the problem has been resolved in the hardware during monitoring.
Thanks
Hi. Firstly, great script, many thanks. Just after ideas as to why this command generates an email every time it runs when it’s should only generate one on a change. Every time it runs I get an email saying server status is normal and healthy.
Cheers
c:\VMwareCLI\bin\esx-health.pl –server x.x.x.x –username xxxx –password xxxx –mailhost mail –maildomain xxx.xxx.xx–mailfrom [removed] –mailto [removed] –cpuwarnpc 75 –memwarnpc 85 –warnonchange
Never mind, we found the answer.
Cheers
Great Script !
It would be very usefull to also fire an alarm when datastore free space is below a specific threshold…
(could also be % used space above a certain value if easier)
Hi, many thanks for leaving feedback. This I think is an excellent idea, I’ll see what I can do! Cheers, James.
Hi Michel. Firstly, I’m sorry for the delay in replying. But, I have added DS space monitoring which can be customised using –dswarnpc and –dscriticalpc – and a couple of other bits, check out the new version. Hope that helps
Thanks James for your update, just tested the latest build and datastore free space alerting works just fine
thanks again for your time, keep up the good work !
Hi, james,
can your script send out an alerting, if the RAID controller, RAID array or a hard disc on VMWare server has problem? Thanks!
Jan
Hi Jan. The script only runs against ESXi 4 and 4.1 I’m afraid. For VMware Server you’d need monitoring components for whatever it’s underlying OS, for example Dell’s OpenManage or Compaq’s Insight Manager. Thanks for the interest!
@DJE,
What was your problem? How did you solve it?
Thanks for this work! I did not know this was possible.
However, and I paraphrase with great respect: “RAID1 health monitor, RAID1 health monitoring, my kingdom (such as it is) for RAID1 health monitoring under ESXI 4.1.”
Ken
[...] for DR and archive purposes, since the performance charting with datastore latency numbers and built-in health monitoring are extremely [...]
Thank you very much for this script. It has made my life a lot easier. I do have a very small issue: I am receiving the email alerts on my Blackberry but since the size of the message is over 32k the message gets truncated, which then excludes some important information at the end of the message. If I am not mistaken it truncates right before POWER COMPONENTS. So my question is: Can I change the order o which the report is displayed? I would like to have storage on the top or at least before the software portion as it is more important. Thanks again
Hi, thanks for posting this. It’s a great point – I’ll try and move things about or put a summary at the top perhaps. Cheers, James.
Hi Neill, I’ve added a new option “–concise” that produces summary output. New code is in the wiki. Hope that helps!
It works fine by me und Thanks!
Amazing! We were actually writing an article on the same exact subject matter last week. Then this particular afternoon, I ran across your page which is much better of information than My spouse and i wrote.
[...] Logicamente sentiamo l’esigenza di automatizzare il controllo, quindi cercando un modo semplice per farlo ho trovato uno script Perl che funziona davvero bene. Il file si chiama esx-health.pl e si può trovare a questo indirizzo: http://blog.peacon.co.uk/hardware-health-alerting-with-esxi/. [...]
Seriously one of the best things for ESXi.
Many thanks, you saved me on some nerves.
Hi,
I’m trying to modify the code to allow to authenticate against an SMTP server. Unfortunately, this seems a bit beyond me, is there anyone that can point me in the right direction? I really love this script but having to use an anonymous SMTP connection isn’t going to work out for me.
Thanks a lot,
Mike
Its is working great for me also.
This script works perfect! But we have a Fujitsu TX200S6 server with 2 powersupplys, but it detect Power Supply 0:0 + 1:0 + 2:0, and there for gives me a error on Power Supply 2:0. But how do I exclude Power Supply 2:0? I have tried the –exclude paramere but it seems to exclude all power supplys
Hi, try –exclude “Power Supply 2″.
That works! Thanks!
Hello there, im just trying out that script and it works fine! Just one question:
If ESXi is configured within Openfiler as a storage, is there a chance to get an E-Mail when ESXi is losing the connection to OF? I was trying the “warnonchange” param and executed the script – after that i pulled off the Ethernetcable from my Openfiler.. Sadly i didnt get an E-Mail!
!~
Hello, the script will generate an alert if a datastore is listed as inaccessible, but it’s not memory resident so needs to be run periodically. I.e. set a scheduled task to run the script at the frequency required, say every hour, and use –warnonchange so it will then send an email when run only if there is a change in the number of alerts since the last run. Hope that helps!
Awsome script. Is there a command line switch to specify the html name? I cannot work it out?
Also my ESXi servers are all 4.0 but 3 out of 9 don’t report any value for the current memory usage which is odd, any ideas?
Hi, thanks for the feedback. Can you clarify ‘html name’?
Hi, sorry by html name I mean esx-host-health-report.html – I was hoping to include the ESXi server name in the report so that when I automate the running I can have all of the outputted reports named accordingly. I was also wondering whether you could have the date added too?
Any ideas about why the memory won’t report on 3 out of 9 servers?
Thanks
Could you post a link to a graphic of one of the reports not working? The default code does include the FQDN (even if connecting by IP address directly), but that needs to be configured on the host of course.
This script is a much better fit for me than the larger one that Lam wrote. My environment just isn’t big enough to warrant those options.
Anyway, I’m running it, and having a problem where it won’t report on performance stats. Nothing comes up under CPU and RAM usage. I took a screen snip here of the emailed report.
http://i.imgur.com/jFXx1.jpg
Everything else in it seems to collect and report just fine. Any ideas?
I’m running ESXi 4.1 on an individual host, and VCLI 4.1 on my reporting machine.
@Brian – we’re having the same problem where the current usage fields are blank unless there is an alarm. Also the datastore alert does not work even though we’ve tried a variety of thresholds.
Thanks for posting this. I’m looking at this – can you confirm the ESX(i) build?
Hi, I’ve fixed the missing usage fields issue – thanks for pointing that out. Revised code in the wiki. Many thanks!
Hi.
We Try your Script esxih and works fine.
We have a request: When log find an new Alert (whith the option -warnonchange=yes) sendmail sent a mail.
I wish that in subsequent tests, finding the same active alert, resend the mail.
In other words, with the option warnonchange but finding active alerts must behave as if the option is not active until warnonchange to the cessation of the alarm.
In this way we would mail until the end of the alarm.
Hi Paolo, this is a good idea. I’ve added –warnonalerts which modifies the behaviour of –warnonchange I think exactly as you suggest – full explanation (and code) in the wiki. Many thanks!
Hi James,
Is there any stuff we need to know on the health script running against ESXi 5?
Do we need to install vcli v5 ?
I’ve tried running the script with vcli v4 still installed with the following result:
VMware ESXi 4.1.0 build-260247: Host has 1 alert:
· CURRENT USAGE
o CPU: 99%
Which is I checked and it not the case..the CPU is not running at 99%
The batch also throws this at me:
print() on closed filehandle LOGFILE at C:\scripts\health\esx-health.pl line 1105.
Generating ESX host health report “esx-host-health-report.html” .
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 1048.
print() on closed filehandle LOGFILE at C:\scripts\health\esx-health.pl line 1105.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 318.
Processing esxhost.domain.local (VMware ESXi 5.0.0 build-469512): 1 alerts.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 481.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 482.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 483.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 484.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 493.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 501.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 502.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 509.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 593.
print() on closed filehandle LOGFILE at C:\scripts\health\esx-health.pl line 1105.
print() on closed filehandle REPORT_OUTPUT at C:\scripts\health\esx-health.pl line 1066.
Sending email to nos...@hotmail.com.
print() on closed filehandle LOGFILE at C:\scripts\health\esx-health.pl line 1105.
As always thanks for this excellent script and work..!
Hi, I’ve not tested it on v5 (and have no intention on doing so, because of the licensing terms). But could you could post some info on your test box (cpus/cores/VMs etc)? Also, the errors seem to suggest that the user account running the script might not have write access to the log file folder (or is the drive full)?
Hi James,
I’ve always had the script run from a scheduled task and decided to let it do its own thing and leave it for a day or so.
Behold this morning I received an email from the script and the output is now what I expected: NORMAL. (with the usual CPU and MEM usage)
(in truth the above output is not from the user that normally runs the script so you might be correct on the rights issue).
Well, what can I offer on details from the script running under v5 ?
- It seems to work for starters
- Of course the version numbers are all mixed up (all drivers and esxi version are still v4 according to the script)
- I found out the hard way that VMWare decided for version 5 to yet again go another route with the hardware. My IBM ServeRaid BR10i adapter is now shown as a simple LSI 1068E adapter. Biggest problem with this is that ESXi no longer shows the health status of the RAID adapter because VMWare no longer uses cim providers but instead leaves it up to the hardware vendor to provide us with a “.vib driver”. So if IBM or LSI decide to skip this card (or any other card) in essence you could up with having less information about your hardware in version 5 compared to version 4 !!!
Oddly the health script still “sees” my adapter as a ServeRaid BR10i and still provides the status of the RAID disk configuration (is that because I’m still using the VCLI version 4..?)
Are you staying with version 4 until VMWare decides to change its tune or are you looking at moving to another hypervisor..?
Regards,
Aleks
Thanks James for the script . . keep up the good work.
Regards . . Sherif
Ive just tried this script against an ESXi5 host – works exactly the same as against my 4.1 host. Running from v5 vMA.
Works a treat, awesome work thanks!