Linaro web logs processing scripts ================================== Installation ------------ apt-get install python-bsddb3 apt-get install awffull apt-get install webalizer apt-get install webdruid apt-get install visitors apt-get install dnshistory Get ip2location Python module from http://www.ip2location.com/developers/python (or via "pip install ip2location" - preferred). Get ip2location IP-Country-Region-City-ISP Database ("DB4") from http://www.ip2location.com/databases/db4-ip-country-region-city-isp . Note: ip2location databases are commercial products. For testing, sample DB4 database can be used (free): http://www.ip2location.com/downloads/sample.bin.db4.zip Dependencies ------------ "dnshistory" tool http://www.stedee.id.au/navigation/dnshistory is used for reverse DNS lookups (and storing result to Berkeley DB). "webalizer", its forks "webdruid" and "awffull", and "visitors" apps are used to actually generate various stats from processed logs. Processing Workflow ------------------- 1. Gzipped daily apache logs for current year are fetched from various systems using rsync. 2. For previous years, there's big preprocessed gzipped log. 3. This year's logs are ungzipped and concatenated into one. 4. dnshistory tool is run in "dolookups" mode - IPs from this year's log are resolved and stored into database, but the log is not updated. 5. iploc.py is run on this year's log to perform name resolution (using previously filled in dnslookup database) and geoip matching (using ip2location database) at the same time. Some geoip information is stored in ident/userid fields of Apache logs. This may break tools which expect some limited-format values in them (like static "-"). Example of such breaky tool is awffull. 6. Processed this year's log and previous year's log is concatenated into single log. For some hosts, this complete log may be filtered to produce focused reports (e.g. toolchain downloads). 7. Complete and filtered logs are run thru configured set of web log processing tools: webalizer, webdruid, visitors. 8. Reports of stats tools are output directly to the location Known Issues ------------ Generally, only IPv4 is supported because of some bottlenecks in processing software (e.g. dnshistory supports only IPv4).