Skip to content. | Skip to navigation

Personal tools
Log in
Sections
You are here: Home Members Zouppen Blog My blog in English

My blog in English

Here you can find blog articles I've done. Usual topics include technology and random tinkering. Also, I write about my life in Finnish.

Cumbersome migration of my old blog

Sometimes things go complicated. This is the story of migrating my Finnish blog from Saunablog to this site.

I have spent the last three days with migration of my old 52 blog messages from Saunablog to Codegrove. I don't want to be dependant of my mobile operator so I decided to migrate the blog on a site I host together with my friends.

Saunablog is hosted by my mobile operator and is running on Nucleus CMS. Codegrove, on the other hand, is running on Plone 4.

Migrating postings shouldn't be too hard. To copy and paste 52 articles. Well, that isn't the point. I can do copy-pasting neverendingly and it gives me nothing. Doing a real migration is the art I am willing to know better.

Please note that these migration scripts I provide are useful only for me. If you need to do something similar, you need to adapt those scripts for your needs.

Looting of old data

The first bumb was I have no access to the database of Saunablog. The site is hosted somewhere and all I've got are ordinary user privileges allowing just posting and editing. I'm calling it looting because I don't really own the servers and the techniques can be used to get something you don't even own.

I started with wget. I logged on my Saunablog in Firefox to get the cookie. Then I used exporting script to get cookies.txt, which can be used with wget.

I wrote hallintasivu.xsl and download-uri.xsl for that purpose. It takes blog administrative interface page as input and produces a wget line per posting. It can be directed to a script file which is run after the first pass. In a fresh directory, run:

$ xsltproc --encoding 'ISO 8859-1' hallintasivu.xsl index.php.html >index.xml
$ xsltproc download-uri.xsl index.xml >tmp-download.sh
$ sh tmp-download.sh

Now you got all those raw messages. But that's only the first step.

About importing to Plone

Plone has support for WebDAV which makes it much simpler to transfer data to and from Plone instance. You don't need to hack Plone, just mount the site to your directory tree. Well, in theory, yes.

Importing data to Plone is like throwing a loaded die. It's getting the desired result most of the time, but there is a chance for a fail. Plone's WebDAV support has been barely documented. It was much easier to think Plone as a black box and to imagine a way to get a rabbit out of the box. After two days of trial and error I managed to do something I was happy with.

So, we need loads of configuration and a mysterious XSLT script to do the trick. Details follow.

Enabling WebDAV on Plone

This is quite staightforward. One has Plone in a directory with buildout configuration. Let's follow the instructions Epeli found and add the following to the [instance] section of buildout.cfg:

zope-conf-additional =
                       enable-ms-author-via on
                       <webdav-source-server>
                       address localhost:1337
                       force-connection-close off
                       </webdav-source-server>

After editing the file, you need to re-run buildout, of course.

$ sudo bin/buildout
$ sudo bin/instance restart

After that, you should have WebDAV running on port 1337. Quite elite, huh?

Mounting Plone instance

You can install WebDAV support for your Debian or Ubuntu box straight from the package manager. First of all, get davfs2.

$ apt-get install davfs2

You can mount the site as an ordinary user but for me it's much better to do it system-wide. I added the following to my local /etc/fstab:

https://my.plone.site/ /mnt/codegrove davfs uid=joell,noauto 0 0

Getting that done was easier than I thought.

Converting data to Plone format

This is the most trickiest part of the work. The results are based on trial and error as I earlier mentioned. So I don't have any sources of information to cite.

Having the blog postings to preserve their timestamps was trickier than I thought. It's difficult to "forge" modification or creation date, but setting effective date (aka publishing date) is a feasible solution. Also, I wanted to enable comments and hide the blog postings from the navigation.

The following headers were optimal for me. The example is taken from one of my postings:

title: Warshavjanka 2.0
description: Kirjoitettu 07.06.2008 klo 15.24
effectiveDate: 2008-06-07 15:24
subject: saunablogi
  finnish
  ajatukset
allowDiscussion: True
language: fi
excludeFromNav: True
Content-Type: text/html

I also added publishing date to the description field because that makes it easy to show dates in listings. Subject is holding the tags. Additional tags are listed on new, indented lines.

I wrote a script called plonefy.xsl to do the dirty part of transforming. I've run it like this:

$ for JOO in saunablog/msg-*; do xsltproc --encoding 'ISO 8859-1' plonefy.xsl $JOO >plone_raw/$(echo $JOO|sed 's/saunablog\/msg-\(.*\)\.html/sb-\1/');done

Migration of in-line pictures

After having some guru meditation I grasped how to get all the pictures related to my blog. I used the following incantation:

$ cat msg-*|grep -o '<%[^%]*%>'|sed 's/<%image(\([^|]*\)|.*/http:\/\/path.to\/my\/blog\/\1/' > pics.txt
$ cd pics
$ wget -i ../kuvat.txt

As you can see, sed is a write-only script.

Nucleus has its own inline image tags in the form of url, width, height and title, separated by |'s. To convert those tags to an ordinary XHTML image tags, the following spell can be sent:

$ for MSG in *; do sed 's/<%image(\([^|]*\)|\([^|]*\)|\([^|]*\)|\([^)]*\))%>/<img src="pics\/\1" alt="\4" title="\4" width="\2" height="\3" \/>/g' <$MSG >../plone_final/$MSG; done

Sending the content to Plone

Now we have the final documents ready for uploading to the Plone instance. I just copied the files in plone_final to the blog folder of my plone instance using cp.

Remember to publish the postings to make them visible.

It has something to do with cp or WebDAV, but the order of the postings seems to be quite random. I work-arounded it by creating a collection and set it to descending order of effective dates. Then I limited the collection to the current location (..) to hide all other files on the site from the collection view.

To made it perfect, I set the collection to have 3 postings per page (from Edit...Number of Items) and to show the postings on collection page (from Display...All Content).

Now we are getting somewhere. All old data has been imported and it's time to start posting new ones!

PS. Thanks to everybody at #codegrove for their patience and help when I was getting mad with this migration.