Offline web browsing

HTTrack automatically downloads a website’s entire content, or even just a specific section of it, for future offline reference

For telecentres with limited internet access, it can be useful to have a local copy of some websites. Web browsers let you save individual pages, but not an entire site. For that, you need specialised software. True offline navigation requires a mirror (copy) of a particular website on a hard drive with the content and link structure to recreate the original site’s structure.

HTTrack is a free and open source website copier. It is multilingual and compatible with Windows (WinHTTrack), Linux and OSX (WebHTTrack). The program is easy to install and use, is fully customisable and actively supported by the developer. This freeware will automatically download a website’s entire content, or even just a specific section of it, for future offline reference.

Copy a site 

Visit www.httrack.com, open the download page from the top menu and select the appropriate version of the program, based on your operating system.

The first time you launch the application, you can select the preferred language. Change the language anytime on the ‘About WinHTTrack Website Copier’ page in the Help menu.

The program uses a wizard to guide you through the necessary steps for capturing and saving websites. On the first screen, you can type the name of a new project or select an existing project to update or resume. It is best to create one project per website or section of a website: name the project after the site you plan to download.

You can also add categories which are optional and useful to group or distinguish between the projects. Create or assign one category (the application will remember them) to each project you start to group them under their respective subject matter. Browse your computer’s hard drive to save the file in the folder of your choice, e.g. C:\My Web Sites.

On the next screen, choose what action to execute from the options: copy a website, resume a download or update a mirror already on your hard drive. If you selected an existing project on the previous screen, the program automatically chooses the appropriate action.

To start a new project, add the URL of the website you want to copy. Websites contains a lot of content and files that may not be important to you. HTTrack developers strongly recommend using URLs for each project that refer only to a section of a website. To download the archive of ICT Update’s past issues, for example, add the following web address: http://ictupdate.cta.int/Issues.
 
Below the text box showing the URL, the ‘Set options’ button opens a window with several tabs each letting you set download parameters and mirror options. Here, you can use filters to skip large data files, like pictures or PDFs, and set bandwidth capacity and maximum page size, among others.
 
The filters are essential to keep the size of the download to a minimum. In the ‘Set options’ screen, select the ‘Scan rules’ tab. Here you can specify the types of files not to download. If downloading ICT Update’s archive, for example, you can choose to exclude the pictures (with file extensions jpg, gif, png) and the PDF files.

The FAQ section on the HTTrack website explains in detail how to set the different parameters and filters correctly, to avoid downloading the entire world wide web.  

The final screen lets you define a remote connection access if necessary and also set a timer to postpone the actual download. In most cases this will not be necessary, and you can leave these options blank and click ‘Finish’ to start the download. Downloading a website will take time and bandwidth. The downloading pane will show what files are in the process of being captured: if some are too slow to get, press the associated ‘Skip’ button to discard them.

The hard drive will typically need between 500MB and 1GB of free memory for each site, or part of a site. And, since downloading so much data takes a long time and uses a lot of bandwidth, it might be a good idea to plan the downloads at quieter times, or even during the night, to save the internet bandwidth for users.

When the site has downloaded, open the folder where the copy was saved on the hard drive, and select the ‘index.html’ file to start using the mirrored site.

Important information

Some websites may block automated website copiers like HTTrack for legal or technical reasons, where the content is protected by copyrighted content or to prevent overuse of the original server’s bandwidth capacity. Wikipedia and YouTube, for example, do not allow website copiers. Content on Wikipedia can be downloaded at the following address: http://en.wikipedia.org/wiki/Wikipedia:Database_download.

The FAQ section on HTTrack.com gives further information on when to notify a site’s administrator before you copy a website. 

---------------------------------------------------------

Related links

www.httrack.com

www.blogtechnika.com

http://betanews.com

13 June 2012

Copyright © 2014, CTA. Technical Centre for Agricultural and Rural Cooperation (ACP-EU)