aboutsummaryrefslogtreecommitdiff
Linaro web logs processing scripts
==================================

Installation
------------

apt-get install python-bsddb3
apt-get install awffull
apt-get install webalizer
apt-get install webdruid
apt-get install visitors
apt-get install dnshistory

Get ip2location Python module from http://www.ip2location.com/developers/python
(or via "pip install ip2location" - preferred).

Get ip2location IP-Country-Region-City-ISP Database ("DB4") from
http://www.ip2location.com/databases/db4-ip-country-region-city-isp .
Note: ip2location databases are commercial products. For testing,
sample DB4 database can be used (free):
http://www.ip2location.com/downloads/sample.bin.db4.zip

Dependencies
------------

"dnshistory" tool http://www.stedee.id.au/navigation/dnshistory
is used for reverse DNS lookups (and storing result to Berkeley DB).

"webalizer", its forks "webdruid" and "awffull", and "visitors" apps are
used to actually generate various stats from processed logs.

Processing Workflow
-------------------

1. Gzipped daily apache logs for current year are fetched from various
systems using rsync.
2. For previous years, there's big preprocessed gzipped log.
3. This year's logs are ungzipped and concatenated into one.
4. dnshistory tool is run in "dolookups" mode - IPs from this year's log
are resolved and stored into database, but the log is not updated.
5. iploc.py is run on this year's log to perform name resolution (using
previously filled in dnslookup database) and geoip matching (using
ip2location database) at the same time. Some geoip information is
stored in ident/userid fields of Apache logs. This may break tools
which expect some limited-format values in them (like static "-").
Example of such breaky tool is awffull.
6. Processed this year's log and previous year's log is concatenated
into single log. For some hosts, this complete log may be filtered to
produce focused reports (e.g. toolchain downloads).
7. Complete and filtered logs are run thru configured set of web log
processing tools: webalizer, webdruid, visitors.
8. Reports of stats tools are output directly to the location 

Known Issues
------------
Generally, only IPv4 is supported because of some bottlenecks in
processing software (e.g. dnshistory supports only IPv4).