Adding compression support can be very simple -- if your spider is coded in Perl using LWP::UserAgent, then the
addition of a single line of code will enable compression support.
$ua->default_header('Accept-Encoding' => 'gzip');and then you need to make sure that you always refer to 'decoded_content' when dealing with the response object.
For other languages, all
you need to do is to add
Accept-encoding: gzipto the HTTP request that you send, and then be prepared to deal with a 'content-encoding: gzip' in the response.
Happily, some of the large spiders do support compression -- the googlebot and Yahoo Slurp do (to name but two). Since I started prodding crawler implementors, a couple have implemented compression (one within hours), and another reported that it was a bug that it didn't work -- which would be fixed shortly.
Crawlers which do more than 5% of the total (uncompressed) crawling activity are marked in bold below.
Crawler | Last IP used |
---|---|
Aranea Web-Crawled Corpora Project (+http://aranea.juls.savba.sk/guest (English 2024 Spring Crawl))" "blog.gladstonefamily.net | 147.213.138.57 |
Aranea Web-Crawled Corpora Project (+http://aranea.juls.savba.sk/guest (English 2024 Spring Crawl))" "pond1.gladstonefamily.net | 147.213.138.57 |
curl/7.54.0" "c-73-227-75-114.hsd1.ma.comcast.net | 139.144.52.241 |
curl/7.54.0" "c-73-227-75-114.hsd1.ma.comcast.net:8080 | 139.144.52.241 |
DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot)" "gladstonefamily.net | 148.251.121.91 |
facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)" "pond.gladstonefamily.net | 173.252.87.13 |
fasthttp | 80.82.77.202 |
Magellan" "gladstonefamily.net | 172.91.101.96 |
masscan/1.0 (https://github.com/robertdavidgraham/masscan)" "- | 212.70.149.134 |
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)" "blog1.gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)" "gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)" "pond.gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; help@moz.com)" "www.gladstonefamily.net | 216.244.66.194 |
Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com)" "pond.gladstonefamily.net | 65.108.74.120 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.5060.134 Safari/537.36 | 138.246.253.24 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" "blog.gladstonefamily.net | 3.14.253.221 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" "blog1.gladstonefamily.net | 3.15.229.113 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" "gladstonefamily.net | 18.119.107.96 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" "pond.gladstonefamily.net | 54.242.220.142 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" "pond1.gladstonefamily.net | 3.16.66.206 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)" "www.gladstonefamily.net | 18.224.37.68 |
VsuSearchSpider/1.0 | 87.153.109.113 |