Compiling and running Crawler4j http://code.google.com/p/crawler4j/ INSTALL UBUNTU Use a spare partition or VirtualBox: http://www.virtualbox.org http://www.ubuntu.com Use version 10.04 LTS You can install any packages you need/like using System>Administration>Synaptics Package Manager INSTALL SUN/ORACLE JAVA FOR UBUNTU Add partner repository using the following command sudo add-apt-repository "deb http://archive.canonical.com/ lucid partner" Update the source list sudo apt-get update Now install sun java packages using the following commands sudo apt-get install sun-java6-jre sun-java6-plugin sun-java6-fonts Alternatively: sudo apt-get install python-software-properties sudo add-apt-repository ppa:ferramroberto/java sudo apt-get update sudo apt-get install sun-java6-jdk sun-java6-plugin Crawler4j 1. make a directory jars and copy all needed jars there (see downloads on the crawler4j page) 2. set the classpath to include all jars export CLASSPATH="jars/crawler4j-3.3.jar:jars/apache-mime4j-core-0.7.jar:jars/geronimo-stax-api_1.0_spec-1.0.1.jar:jars/apache-mime4j-dom-0.7.jar:jars/httpclient-4.1.2.jar:jars/asm-3.1.jar:jars/httpcore-4.1.4.jar:jars/boilerpipe-1.1.0.jar:jars/je-4.0.92.jar:jars/commons-codec-1.5.jar:jars/log4j-1.2.14.jar:jars/commons-compress-1.3.jar:jars/commons-logging-1.1.1.jar:jars/metadata-extractor-2.4.0-beta-1.jar:jars/crawler4j-3.3.jar:jars/tagsoup-1.2.1.jar:jars/tika-core-1.0.jar:jars/tika-parsers-1.0.jar:." 3. get the basic crawler files and edit (delete the package instruction, etc) 4. compile javac BasicCrawler.java javac BasicCrawlController.java 5. start the crawler java BasicCrawlController data 1