aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorDavid Spickett <david.spickett@linaro.org>2022-10-26 11:56:07 +0100
committerDavid Spickett <david.spickett@linaro.org>2022-10-27 09:43:38 +0100
commitd90e09ac78ad7db816375ffb356016c6894c1804 (patch)
tree8c7be77c8149e9907913fb4acbe965574b3abda8
parenta4f23fe62b045c3e6db25c7245dac01e85d922e4 (diff)
llvmbot monitor: Update README
This removes a lot of now incorrect information and dead links. The description of the JSON format has been updated also. Change-Id: I9001baa2e9af5f64ea031e8b4dacf64337f72e8c
-rw-r--r--monitor/README.txt195
1 files changed, 91 insertions, 104 deletions
diff --git a/monitor/README.txt b/monitor/README.txt
index 4faef1a..2ca5a4a 100644
--- a/monitor/README.txt
+++ b/monitor/README.txt
@@ -1,128 +1,115 @@
-Monitoring Tools for LLVM Development
-=====================================
+LLVM Buildbot Monitor
+=====================
-These tools are not meant to be used for development or testing, but to be
-left running on a server or desktop as monitoring for your buildbots. They
-are also meant to be used in conjunction, not as a replacement, to Nagios
-and other hardware-level monitoring tools.
+This is to be left running on a server or desktop as monitoring for your buildbots.
+It purely reports the status of the builds. If you want hardware monitoring, look
+elsewhere.
-Currently we only have one: bot-monitor, which I keep running on Linaro's
-public server (people.linaro.org) and keep it as a bookmark to quickly check
-the bot status. It's also a helpful bookmark for all bots we care.
+It supports Buildbot (as used by LLVM) and Buildkite (https://buildkite.com/, used by LLVM's
+libcxx project). It does not support LLVM Green Dragon (https://green.lab.llvm.org/green/).
-JSON Documentation
-------------------
+Buildkite does require an API key to be available, see the script for how to do that
+and contact your Builkite org admin to get one.
-The JSON file should be self-explanatory, but just in case, here's a few
-of the behaviours it exhibits when rendered by the current version of the
-bot-monitor.
+Currently we have one monitor running at http://llvm.validation.linaro.org/.
+Bookmark this if you have a need to check bot status at a glance.
-The base structure is a list of masters, which has a few properties and a list
-of builder groups, which in turn also have some properties and a list of slaves.
-
-Master properties:
-
- "name": "Name of the master, which will appear in bold big letters",
- "base_url": "http://SERVER:PORT/BASE",
- "builder_url": "part of the URL that refers to the list of builders",
- "build_url": "part of the URL that refers to the list of builds",
- "ignore" : "true | false, shows or hide the entire master from the page"
- "builders": [ ... ]
-
-Builder properties:
-
- "name": "Name of this group (fast bots, self-hosting, etc)",
- "ignore" : "true | false, shows or hide the entire builder from the page"
- "bots": [ ... ]
-
-Bots properties:
-
- "name": "Exact name of the buildbot (becomes part of the URL)",
- "ignore": "true | false, to ignore or not failures in this bot"
-
-Note that "ignore" has two different behaviour:
-
- * On masters and builders, it omits the entire class from the output
- * On bots, it still shows them, but ignores their status
+JSON Format
+-----------
-Note on bots:
+The JSON file describes the bots we want to monitor and which master/build service
+they connect to.
- * You can repeat bots across builders, if they belong to multiple classes, for
- example "self-hosting" and "test-suite". The script will cache the results
- and simply re-print them, so this is *only* for visualisation / organisation
- purposes.
- * Using the same bot name on different masters means *different* bots. It may
- be the same configuration on two different masters, or it may be completely
- different bots. Beware.
+Buildbot JSON Format
+--------------------
+The base structure is a list of masters, which has a few properties and a list
+of builder groups, which in turn also have some properties and a list of bots
+(which in Buildbot terms are actually called "Builders" but we ended up calling
+them bots here).
-HTML Page
----------
-
-For now, there's only HTML output, but there's nothing stopping we to develop
-more forms of communication (email, IRC bots, etc).
-
-The HTML page is separated into blocks: Masters, Builder Groups, Bots. It also
-has a date on the top, to make sure you're looking at an up-to-date page, and
-it changes the page icon from green to red if at least one (non-ignored) bot
-is broken.
-
-Bots offline are considered broken, as they may require attention. But when the
-admin restarts the master, that kills all buildslaves, and this show up as
-"slave lost". You don't need to do anything, just wait for the next successful
-build.
-
-Each buildbot has four columns:
+Master properties:
- * Name & link: The bot name with a link to its page on its master. Good for
- easy access to buildbots and masters.
- * Status: Can only be "PASS" or "FAIL", but contains additional information
- if it fails, ex. "slave lost" or "build stage 1" or "test-suite". These are
- the name of the stages that failed.
- * Time: the total time spent in build reported as HH:MM:SS.
- * Build number: The build number, to help identify if there is a change from
- a specific number. Not very useful, but there just for reference.
- * Commit range: The range of commits that were tested on that build. This is
- very helpful to identify if a slow bot is failing because it hasn't yet
- reached the commit range on a fast bot that is passing, or not.
- * Comments: The reported status string in the case of a failure.
+ "name": Name of the master, which will appear as the section title.
+ "base_url": The base URL of the master, which will be used to make API calls.
+ For example for LLVM this might be "https://lab.llvm.org/buildbot".
+ "builder_url": The part of the URL that refers to the list of builders.
+ Will be added to base_url when making API calls.
+ "build_url": Part of the URL that refers to the list of builds. Added to base_url
+ when making API calls.
+ "ignore" : Set to "true" to hide the master from the page.
+ "builders": [ ...a list of builder groups as detailed below... ]
+Builder group properties:
-LLVM Masters
-------------
+ "name": Name of this group. "fast bots", "self-hosting", etc. Used as the section title.
+ "ignore" : Set to "true" to hide this builder group from the page.
+ "bots": [ ...a list of bots as decribed below... ]
-There are a number of masters in the LLVM upstream infrastructure, and we may
-need to monitor bots in all of those, or switch between them, depending on the
-need.
+Bot properties:
-* LLVM Upstream main master: http://lab.llvm.org:8011/
+ "name": The exact name of the buildbot. This will be used to build URLs for API calls.
+ "ignore": Set to "true" to ignore the status of this bot.
-This is the main master that spams everyone every time one of the bots break.
-Unless there is any specific concern, bots should be in this master.
+Notes on bots:
+ * Bots may be repeated across builder groups if they fall into multiple categories
+ (this does not slow down the monitor as results are cached).
+ * The same bot name on 2 different masters refers to 2 different bots.
-* LLVM Upstream silent master: http://lab.llvm.org:8014/
+Note that "ignore" has two different behaviours:
-Exactly the same as above, but no emails are sent. This master is usually empty
-except for the bots that may be noise temporarily, in active development, or
-being a bot that doesn't track compiler regressions, but performance regressions
-which is monitored on another page (http://llvm.org/perf/)
+ * On masters and builder groups, it omits the entire section from the output.
+ * On bots it shows the bot but ignores their status. Meaning that an ignored bot failing
+ does not make the overall page status failed.
-* LLVM Japan master: http://bb.pgr.jp/
+Buildkite JSON Format
+---------------------
-A side master built by Nakamura Takumi with some x86 and x86_64 buildbots. We
-rarely need to monitor anything there, but it's good to know it's there.
+The Buildkite format follows the Buildbot format closely with some differences.
-* Linaro Downstream master: http://buildmaster.tcwglab.linaro.org/
+ * Since Buildkite is a centralised service the "name" is always "Buildkite".
+ * The "base_url" is always "https://www.buildkite.com".
+ * There is a new key "buildkite_org". This is used to find our particular bots
+ via the API.
+ * The "builder_url" and "build_url" keys are used to form clickable links,
+ and are always "builders" and "builds".
+ * Bots have an extra key "buildkite_pipeline" (explained below).
-Our local master, that we use for development. Individual developers can have
-their own containers, in which case, the masters will be in different ports.
+When querying Buildkite we make a request like this:
+"show me the results of the pipeline <buildkite_pipeline> in the organisation <buildkite_org>"
-These bots should always be ignored for their global status, or we'll generate
-a lot of noise to ourselves. Unless, of course, they're in their way upstream
-and going through staging deployment.
+From there the script looks for the bot's name in the last finished build. After that
+all the processing is the same as the Buildbot results.
-* Green Dragon bots: http://lab.llvm.org:8080/green/
+HTML Page
+---------
-This is not a buildbot master, but Jenkins. We don't monitor those in our page
-but they do have IRC bots in the #llvm channel and are already quite good at
-displaying success and failures.
+The script will generate an HTML page. This page is separated into blocks:
+ * Masters which contain...
+ * Builder Groups which contain...
+ * Bots
+
+The date is printed at the top of the page so you know when the results were generated.
+The favicon will change from green to red when there is at least one failed bot, that
+has not itself been ignored.
+
+Bots that are offline or partially fail to read via the API will show up with a message
+along the lines of "<bot name> is offline!". The page should still update correctly
+for the rest of the bots.
+
+Each listed bot has these columns:
+
+ * "Buildbot": This shows the name and a link to the master's web interface for the bot.
+ * "Status": The status of the last finished build. PASS or FAIL (currently cancelled
+ is also treated as a failure).
+ * "T Since": The time since the last build finished. This is useful for spotting bots
+ that have gotten disconnected. If this time is greater than 24 hours, it will be shown
+ in red.
+ * "Duration": The length of the last build.
+ * "Build #": The build number of the last finished build, which itself will be a link
+ to the results page for that build.
+ * "Commits": The commit range (if known) of the build, if that build was a failure.
+ * "Failing steps": The failed build steps, if it was a failed build.
+
+Note: "finished" here refers to the build ending be that by success, cancellation or
+failure.