Monitoring Tools for LLVM Development ===================================== These tools are not meant to be used for development or testing, but to be left running on a server or desktop as monitoring for your buildbots. They are also meant to be used in conjunction, not as a replacement, to Nagios and other hardware-level monitoring tools. Currently we only have one: bot-monitor, which I keep running on Linaro's public server (people.linaro.org) and keep it as a bookmark to quickly check the bot status. It's also a helpful bookmark for all bots we care. JSON Documentation ------------------ The JSON file should be self-explanatory, but just in case, here's a few of the behaviours it exhibits when rendered by the current version of the bot-monitor. The base structure is a list of masters, which has a few properties and a list of builder groups, which in turn also have some properties and a list of slaves. Master properties: "name": "Name of the master, which will appear in bold big letters", "base_url": "http://SERVER:PORT/BASE", "builder_url": "part of the URL that refers to the list of builders", "build_url": "part of the URL that refers to the list of builds", "ignore" : "true | false, shows or hide the entire master from the page" "builders": [ ... ] Builder properties: "name": "Name of this group (fast bots, self-hosting, etc)", "ignore" : "true | false, shows or hide the entire builder from the page" "bots": [ ... ] Bots properties: "name": "Exact name of the buildbot (becomes part of the URL)", "ignore": "true | false, to ignore or not failures in this bot" Note that "ignore" has two different behaviour: * On masters and builders, it omits the entire class from the output * On bots, it still shows them, but ignores their status Note on bots: * You can repeat bots across builders, if they belong to multiple classes, for example "self-hosting" and "test-suite". The script will cache the results and simply re-print them, so this is *only* for visualisation / organisation purposes. * Using the same bot name on different masters means *different* bots. It may be the same configuration on two different masters, or it may be completely different bots. Beware. HTML Page --------- For now, there's only HTML output, but there's nothing stopping we to develop more forms of communication (email, IRC bots, etc). The HTML page is separated into blocks: Masters, Builder Groups, Bots. It also has a date on the top, to make sure you're looking at an up-to-date page, and it changes the page icon from green to red if at least one (non-ignored) bot is broken. Bots offline are considered broken, as they may require attention. But when the admin restarts the master, that kills all buildslaves, and this show up as "slave lost". You don't need to do anything, just wait for the next successful build. Each buildbot has four columns: * Name & link: The bot name with a link to its page on its master. Good for easy access to buildbots and masters. * Status: Can only be "PASS" or "FAIL", but contains additional information if it fails, ex. "slave lost" or "build stage 1" or "test-suite". These are the name of the stages that failed. * Build number: The build number, to help identify if there is a change from a specific number. Not very useful, but there just for reference. * Commit range: The range of commits that were tested on that build. This is very helpful to identify if a slow bot is failing because it hasn't yet reached the commit range on a fast bot that is passing, or not. LLVM Masters ------------ There are a number of masters in the LLVM upstream infrastructure, and we may need to monitor bots in all of those, or switch between them, depending on the need. * LLVM Upstream main master: http://lab.llvm.org:8011/ This is the main master that spams everyone every time one of the bots break. Unless there is any specific concern, bots should be in this master. * LLVM Upstream silent master: http://lab.llvm.org:8014/ Exactly the same as above, but no emails are sent. This master is usually empty except for the bots that may be noise temporarily, in active development, or being a bot that doesn't track compiler regressions, but performance regressions which is monitored on another page (http://llvm.org/perf/) * LLVM Japan master: http://bb.pgr.jp/ A side master built by Nakamura Takumi with some x86 and x86_64 buildbots. We rarely need to monitor anything there, but it's good to know it's there. * Linaro Downstream master: http://buildmaster.tcwglab.linaro.org/ Our local master, that we use for development. Individual developers can have their own containers, in which case, the masters will be in different ports. These bots should always be ignored for their global status, or we'll generate a lot of noise to ourselves. Unless, of course, they're in their way upstream and going through staging deployment. * Green Dragon bots: http://lab.llvm.org:8080/green/ This is not a buildbot master, but Jenkins. We don't monitor those in our page but they do have IRC bots in the #llvm channel and are already quite good at displaying success and failures.