Monitoring Tools for LLVM Development
=====================================

These tools are not meant to be used for development or testing, but to be
left running on a server or desktop as monitoring for your buildbots. They
are also meant to be used in conjunction, not as a replacement, to Nagios
and other hardware-level monitoring tools.

Currently we only have one: bot-monitor, which I keep running on Linaro's
public server (people.linaro.org) and keep it as a bookmark to quickly check
the bot status. It's also a helpful bookmark for all bots we care.

JSON Documentation
------------------

The JSON file should be self-explanatory, but just in case, here's a few
of the behaviours it exhibits when rendered by the current version of the
bot-monitor.

The base structure is a list of masters, which has a few properties and a list
of builder groups, which in turn also have some properties and a list of slaves.

Master properties:

    "name": "Name of the master, which will appear in bold big letters",
    "base_url": "http://SERVER:PORT/BASE",
    "builder_url": "part of the URL that refers to the list of builders",
    "build_url": "part of the URL that refers to the list of builds",
    "ignore" : "true | false, shows or hide the entire master from the page"
    "builders": [ ... ]

Builder properties:

    "name": "Name of this group (fast bots, self-hosting, etc)",
    "ignore" : "true | false, shows or hide the entire builder from the page"
    "bots": [ ... ]

Bots properties:

    "name": "Exact name of the buildbot (becomes part of the URL)",
    "ignore": "true | false, to ignore or not failures in this bot"

Note that "ignore" has two different behaviour:

 * On masters and builders, it omits the entire class from the output
 * On bots, it still shows them, but ignores their status

Note on bots:

  * You can repeat bots across builders, if they belong to multiple classes, for
    example "self-hosting" and "test-suite". The script will cache the results
    and simply re-print them, so this is *only* for visualisation / organisation
    purposes.
  * Using the same bot name on different masters means *different* bots. It may
    be the same configuration on two different masters, or it may be completely
    different bots. Beware.


HTML Page
---------

For now, there's only HTML output, but there's nothing stopping we to develop
more forms of communication (email, IRC bots, etc).

The HTML page is separated into blocks: Masters, Builder Groups, Bots. It also
has a date on the top, to make sure you're looking at an up-to-date page, and
it changes the page icon from green to red if at least one (non-ignored) bot
is broken.

Bots offline are considered broken, as they may require attention. But when the
admin restarts the master, that kills all buildslaves, and this show up as
"slave lost". You don't need to do anything, just wait for the next successful
build.

Each buildbot has four columns:

 * Name & link: The bot name with a link to its page on its master. Good for
   easy access to buildbots and masters.
 * Status: Can only be "PASS" or "FAIL", but contains additional information
   if it fails, ex. "slave lost" or "build stage 1" or "test-suite". These are
   the name of the stages that failed.
 * Build number: The build number, to help identify if there is a change from
   a specific number. Not very useful, but there just for reference.
 * Commit range: The range of commits that were tested on that build. This is
   very helpful to identify if a slow bot is failing because it hasn't yet
   reached the commit range on a fast bot that is passing, or not.


LLVM Masters
------------

There are a number of masters in the LLVM upstream infrastructure, and we may
need to monitor bots in all of those, or switch between them, depending on the
need.

* LLVM Upstream main master: http://lab.llvm.org:8011/

This is the main master that spams everyone every time one of the bots break.
Unless there is any specific concern, bots should be in this master.

* LLVM Upstream silent master: http://lab.llvm.org:8014/

Exactly the same as above, but no emails are sent. This master is usually empty
except for the bots that may be noise temporarily, in active development, or
being a bot that doesn't track compiler regressions, but performance regressions
which is monitored on another page (http://llvm.org/perf/)

* LLVM Japan master: http://bb.pgr.jp/

A side master built by Nakamura Takumi with some x86 and x86_64 buildbots. We
rarely need to monitor anything there, but it's good to know it's there.

* Linaro Downstream master: http://buildmaster.tcwglab.linaro.org/

Our local master, that we use for development. Individual developers can have
their own containers, in which case, the masters will be in different ports.

These bots should always be ignored for their global status, or we'll generate
a lot of noise to ourselves. Unless, of course, they're in their way upstream
and going through staging deployment.

* Green Dragon bots: http://lab.llvm.org:8080/green/

This is not a buildbot master, but Jenkins. We don't monitor those in our page
but they do have IRC bots in the #llvm channel and are already quite good at
displaying success and failures.