This time I have to deal with another form of leech other than woman I know of … he he he. Some user hogging my server bandwidth for several time, and this abuse couldn’t continue. Here is some tricks that can denying the bandwidth sucker user by detecting their browser user agent. Fortunately, apache has a built-in directive called BrowserMatch, and its sister BrowserMatchNoCase. These allow you to block clients based on their UserAgent string, and they will work without mod_rewrite module. Yeahh … I know … the UserAgent string itself can be spoofed, but hell yeah … just like in real life … she can be very deceiving also with cosmetic and tears, the most lethal weapon from women 🙂 … but, back to topics, this will take care of the vast majority of your would-be bandwidth hogs.
Looking for web spiders and site suckers :
Here’s a shortcut for identifying all user agents in your logs.
{code}cat access.log | awk -F “”” {‘print $6’} | sort | uniq | grep -v Mozilla{/code}
Translation: Run the logfile through a filter, treating double-quotes as field separators, pick out the 6th field (which is the UserAgent field), sort the resulting list, toss out all duplicates, and don’t show me anything containing “Mozilla”. NOTE that many programs include “Mozilla” in their useragent strings; also, some download accelerators operate as plugins to the regular browser and then append their useragent specifics to the browser’s string. So, if you want to be more thorough than this, leave off everything after “uniq” and you’ll see it all — including stuff like :
Mozilla/5.0 (Windows; U; Windows NT 5.1; vi; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10 ;ShopperReports
Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.13) Gecko/20100914 Ant.com Toolbar 2.0.1 Firefox/3.5.13
which may or may not be legit, but are definitely unusual.
This will take awhile for large logs. Leave off everything after “uniq” to see all of the UserAgents; I filter out Mozilla since it is what appears for most normal browsers, including IE.
Also, this assumes you are using “combined” log format. If you don’t get the results you expect (see below), then try replacing the “$6” with another number — your log format may order the fields differently.
Sample results:
{code}cat /var/log/httpd/access.log | awk -F “”” {‘print $6’} | sort | uniq | grep -v Mozilla{/code}
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=17527023618099803589)
Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; feed-id=7166567229941680435)
Feedjit Favicon Crawler 1.0
Firefox (SBUA)
FlashGet
GetGo Download Manager 4.0 (www.getgosoft.com)Pragma: no-cache
Googlebot-Image/1.0
Googlebot/2.1 (+http://www.google.com/bot.html)
Axel 1.0b (Linux)
BaiduImagespider+(+http://www.baidu.jp/spider/)
Baiduspider+(+http://www.baidu.com/search/spider.htm)
Baiduspider+(+http://www.baidu.jp/spider/)
BlackBerry8520/4.6.1.314 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/123
BlackBerry8520/4.6.1.314 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/611
BlackBerry8520/4.6.1.314 Profile/MIDP-2.0 Configuration/CLDC-1.1 VendorID/613
IRLbot/1.0 (+http://irl.cs.tamu.edu/crawler)
Iltrovatore-Setaccio/1.2 (It-bot; http://www.iltrovatore.it/bot.html; [email protected])
Java1.3.1
LWP::Simple/5.64
LeechGet 2004 (www.leechget.net)
Links (2.1pre15; Linux 2.6.7-hardened-r16 i686; 80×40)
Lynx/2.8.4rel.1 libwww-FM/2.14
Mediapartners-Google/2.1
Monica/1.4
Opera/7.50 (X11; Linux i386; U) [en]
RealDownload/4.0.0.40
SIE-M55/10 UP.Browser/6.1.0.5.c.6 (GUI) MMP/1.0 (Google WAP Proxy/1.0)
SafariBookmarkChecker/1.26 (+http://www.coriolis.ch/)
SiteBar/3.2.6
Opera/9.80 (S60; SymbOS; Opera Mobi/499; U; en) Presto/2.4.18 Version/10.00
Opera/9.80 (Windows Mobile; Opera Mini/5.1.21594/20.2497; U; en) Presto/2.5.25
Wget/1.10.2 (Red Hat modified)
Wget/1.11.4
Wget/1.12 (linux-gnu)
MWG Atom Life Profile/MIDP-2.0 Configuration/CLDC-1.1 UNTRUSTED/1.0
MxAgent
NetScape/0.02 [fu] (MAC OS; X; SK)
Nokia5300/2.0 (05.51) Profile/MIDP-2.0 Configuration/CLDC-1.1
Nokia6070/2.0 (03.20) Profile/MIDP-2.0 Configuration/CLDC-1.1
Nokia6085/2.0 (03.71) Profile/MIDP-2.0 Configuration/CLDC-1.1
Nokia6235/1.0 (S190V0200.nep) UP.Browser/6.2.3.2 MMP/2.0
iCab/2.9.8 (Macintosh; U; 68K)
SonyEricssonG502/R1FA Browser/NetFront/3.4 Profile/MIDP-2.1 Configuration/CLDC-1.1
SonyEricssonK510i/R4EA Browser/NetFront/3.3 Profile/MIDP-2.0 Configuration/CLDC-1.1
SonyEricssonW380i/R10CA Browser/NetFront/3.3 Profile/MIDP-2.0 Configuration/CLDC-1.1
SonyEricssonX2/R3AA Browser/Mozilla/4.0 (compatible; MSIE 6.0; Windows CE; IEMobile 8.12; MSIEMobile 6.5) Profile/MIDP-2.1 Configuration/CLDC-1.1
Sosospider+(+http://help.soso.com/webspider.htm)
StackRambler/2.0 (MSIE incompatible)
msnbot/1.0 (+http://search.msn.com/msnbot.htm)
Windows-Media-Player/10.00.00.3993
Windows-Media-Player/11.0.6000.6352
WordPress/3.0.1; http://trik-fb.co.cc
WordPress/3.0; http://www.ahlidesain.com
Xunlei@Http Download
YahooMobile/1.0 (Resource; Server; 1.0.0)
Yeti/1.0 (NHN Corp.; http://help.naver.com/robots/)
When you see something odd or suspicious (like “LeechGet”, gee I wonder what that does), Google around for it. If the name is too general, add “user agent” or “spider” or “search engine” to your query.
Now, it’s time to applying those knowledge :
1. To apache2.conf, add the following
{code}BrowserMatchNoCase ^NameOfBadProgram1 nameofenv
BrowserMatchNoCase ^NameOfBadProgram2 nameofenv
BrowserMatchNoCase ^NameOfBadProgram3 nameofenv{/code}
Use the same “nameofenv” value for all of the agents you want to block. I added this section to some preexisting BrowserMatch directives that had to do with forcing HTTP responses to certain browser versions. Here’s what mine looks like right now, I will be adding to it as my logs reveal new twits:
{code}# anti-bandwidth-sucker measures by jfdesign added
BrowserMatchNoCase ^wget suckers
BrowserMatchNoCase ^SiteSucker suckers
BrowserMatchNoCase ^iGetter suckers
BrowserMatchNoCase ^larbin suckers
BrowserMatchNoCase ^LeechGet suckers
BrowserMatchNoCase ^RealDownload suckers
BrowserMatchNoCase ^Teleport suckers
BrowserMatchNoCase ^Webwhacker suckers
BrowserMatchNoCase ^WebDevil suckers
BrowserMatchNoCase ^Webzip suckers
BrowserMatchNoCase ^Attache suckers
BrowserMatchNoCase ^SiteSnagger suckers
BrowserMatchNoCase ^WX_mail suckers
BrowserMatchNoCase ^EmailCollector suckers
BrowserMatchNoCase ^WhoWhere suckers
BrowserMatchNoCase ^Roverbot suckers
BrowserMatchNoCase ^ActiveAgent suckers
BrowserMatchNoCase ^EmailSiphon suckers{/code}
2. Now, inside the Directory blocks for each directory you want to apply these to, put
{code}deny from env=suckers{/code}
You could also add this to individual directories’ .htaccess files, but I have not tried this yet. I simply added it at the end of my main Directory block:
{code}Options Indexes FollowSymLinks
AllowOverride AuthConfig
deny from env=suckers{/code}
You can create other environments for things like known email-harvesting bots, known-evil web spiders, etc. and do more creative things based on which type of malicious visitors they are. I am content to just refuse connections from all of them.
Don’t forget to restart apache (service apache2 restart) after applying any of these changes.