Website Copying and Offline Browsing with HTTrack ~ Constellations

In the “Reconnaissance” portion of The Basics of Hacking and Penetration Testing, Engebretson discusses various methods of collecting data about the target of a penetration test. If stealth is one of your objectives, then you want to use as many passive (and as few active) reconnaissance methods as possible. Enter HTTrack, which allows you to make a page-by-page copy of a website that you can then browse offline.

The copied website will be identical to the real one, but will exist only on your local machine (meaning fewer opportunities to be tracked or detected). This allows you to through the website as long or as often as you want without tipping off company’s server.

You might also want to archive a site, or browse with a slow internet connection–HTTrack is good for those use cases as well.

Sounds good. How do I get HTTrack?

If you are running Kali Linux, then HTTrack comes pre-installed.

If you’re running another version of Linux, then you can type apt-get install httrack.

If you’re running another OS (Windows, Mac, etc.), you can find installation packages on the downloads page of HTTrack’s website.

Site-copying example using HTTrack

First, please read HTTrack’s guide on what to do and more importantly, what not to do, so as not to abuse bandwidth or violate copyright laws. It would be wise to only use HTTrack on websites that you have permission to do so.

You can use httrack as a one-line command (with command-line arguments), or you can work through their interactive prompt-based guide. To start, type:

httrack

You will see a message welcoming you (for help options, type httrack --help). You’ll be asked to enter a project name:

Enter project name: Blog Example

Base path is where the program will store the copied website.  Hitting return or enter will use the default location of `/root/websites/`.

Base path (return=/root/websites/): [hit return or enter a new location]

Next, you’ll need to enter the URL(s) you intend to copy.

Enter URLs (separated by commas or blank spaces): https://jaimelightfoot.com

You’ll then be shown a list of Actions (not including 0, to quit):

Mirror Web Site(s)
Mirror Web Site(s) with Wizard
Just Get Files Indicated
Mirror ALL links in URLs (Multiple Mirror)
Test Links in URLs (Bookmark Test)

You will be asked for a proxy (or hit return for no proxy). From the user guide:

Many users use a proxy for many of their functions. This is a key component in many firewalls, but it is also commonly used for anonymizing access and for exploiting higher speed communications at a remote server.

Next, you can define wildcards to filter out the type of results you want. ‘+’ is for accepting links and ‘-’ is for avoiding them.

You can define wildcards, like -*.gif +www.*.com/*.zip -*img_*.zip

Lastly, you can specify additional options. These include:

Limits options (to limit depth, transfer rate, mirror time, overall size, etc.)
Flow control (handling timeouts, number of connections, number of retries, etc.)
Links options (link parsing and URL testing)
Build options (structure type, replacing external links, etc.)
Spider options (accept cookies, follow robots.txt, keep-alive, etc.)
Browser ID (sending user-agent or default referer HTTP header fields, specify other HTTP headers)
Expert options( priority mode, more scanning options)
Guru options (“do NOT use if possible”)
Dangerous options ([all caps yelling not to use])

This will then spit out a one-line command line equivalent of what you’re asking for, and ask if you’re ready to being (Y/n). Type “Y” and it will begin the mirror.

Open up the links in a browser (firefox to launch Firefox) and browser your site. If it is not as expected, look at the hts-log.txt log in the website copy directory to debug.

A full user’s guide (man page style) can be found here.

Any gotchas?

HTTrack has a few weak spots. Other users have reported that it struggles with PHP-style links. Additionally, HTTrack’s FAQ describes some known cases that won’t work:

Flash sites – no full support
Intensive Java/Javascript sites – might be bogus/incomplete
Complex CGI with built-in redirect, and other tricks – very complicated to handle, and therefore might cause problems
Parsing problem in the HTML code (cases where the engine is fooled, for example by a false comment () detected. Rare cases, but might occur. A bug report is then generally good!

As always, be careful about what you have permission to do. Happy copying!