We use the open source spider heritrix to crawl the web and hadoop distributed file system HDFS to store data. Our java based crawler is programmed to politely crawl and honor robots.txt, please send us an email if our spider is overloading your server.
|