Making an offline copy of a wiki

We use a Confluence wiki for one of the projects that I work on. Wikis can be a fantastic tool for collaboration, and this wiki is a single place where we can share information and our progress.

But we’ve been having problems with the reliability of the wiki – it is unavailable at times, and can be painfully slow at others. Key information that I need is in that wiki, and when the wiki goes down it can be difficult and frustrating.

Yesterday, I had a play with wget to try and download an offline copy of the wiki to use as a backup for when it isn’t working or is going painfully slow.

I’ve put the steps I took here, in case they will be useful for others.

Step 1 – Getting wget

I already had wget on my Ubuntu desktop, but if you are on Windows you can google for “wget for windows”.

Step 2 – Ignoring robots.txt

The robots.txt file for where our wiki is hosted is configured to prevent automated bots from leeching content from the site. And by default, wget respects instructions in robots.txt.

To disable this, I had to create .wgetrc file in my home directory and add “robots = off” to it.

Step 3 – Getting a session cookie

The wiki we use requires authentication. wget provides several approaches to this, from HTTP authentication to letting you spoof form variables in the commands it sends.

The approach I found to work is to use Firefox to access the site, logging on with my userid and password, then letting use the cookie generated by my Firefox session.

Firefox cookies dialogIf you click on “Show Cookies” in the Firefox Options, you can search for the wiki URL.

The useful bit was the Content value for the JSESSIONID.

The Cookies dialog also shows when this cookie expires.

In my case, it was only valid until the end of my browsing session, so it isn’t reusable any more.

Step 4 – Run wget and let it go

wget --mirror --convert-links --html-extension --no-parent --wait=5 --header "Cookie: JSESSIONID=0000k8lkMXmvmF-75Pd8CuvTIBv:-1" https://mywiki.com/my/path/Home

--mirror selects the default options to mirror a site – such as enabling recursion. With this enabled, it not only downloaded the ‘Home’ page I pointed it at, but it followed the links on that page, and downloaded them and so on.

--convert-links enables the ability for wget to, after downloading all of the pages in the wiki, to rename the links within the downloaded pages to point at my new local copy. It converts all links to downloaded pages to relative links, from their original absolute links that would have pointed back to the live online wiki.

--html-extension was used because pages in my wiki don’t end in .html. The URLs for pages in our wiki end with the page name. So a page about apples would have a URL of https://mywiki.com/my/path/Apples. With this option enabled, this is treated as an HTML page and the local copy renamed to Apples.html accordingly.

--no-parent made sure that I only downloaded stuff from the particular wiki I was interested in, by preventing wget from ascending to any parent directory of the URL I gave it. By not including the --span-hosts option, I also made sure that no web links away from the wiki site were followed.

--header was where I provided the session id I obtained from the cookie created for Firefox.

--wait means that wget will wait for 5 seconds between downloading each page. My downloading isn’t urgent, so doing it slowly spreads the server load out a bit.

And that’s it… it downloaded a copy of every page, complete with every image, and fixed all links and references to be relative paths to my local copy.

Tags: , , , ,

4 Responses to “Making an offline copy of a wiki”

  1. […] to this post for the idea. This entry was posted in computers, internet. Bookmark the permalink. ← […]

  2. Rich Cumbers says:

    The command is missing –page-requisites if you want to get a complete copy with every image and css artifact.

    Rich

  3. Chris says:

    –header was where I provided the session id I obtained from the cookie created for Firefox.

    What is the purpose of doing that?

  4. dale says:

    Chris – see step three above

    Step 3 – Getting a session cookie

    The wiki we use requires authentication. wget provides several approaches to this, from HTTP authentication to letting you spoof form variables in the commands it sends.

    The approach I found to work is to use Firefox to access the site, logging on with my userid and password, then letting use the cookie generated by my Firefox session.