MediaWiki2XWiki Extension
This application makes it possible to import contents of a MediaWiki instance to an XWiki instance. It uses- A MediaWiki XML dump (for instance the Wikipedia one, downloaded from 1)
- Dom4J for parsing Wikipedia XML contents
- WikiModel for converting MediaWiki syntax to XWiki syntax.
- The Groovy script below
Groovy script
import org.dom4j.io.SAXReader import org.dom4j.* import groovy.net.xmlrpc.* import java.net.ServerSocket import org.wikimodel.wem.mediawiki.MediaWikiParser import org.wikimodel.wem.xwiki.* class PruningPageHandler implements ElementHandler { def proxy, token; def counter = 0; def max = 10000; PruningPageHandler(proxy, token) { this.proxy = proxy this.token = token } def messages = [] public void onStart(ElementPath path) { } public void onEnd(ElementPath path) { def page = path.current def title = page.elementText('title') title = title.replaceAll(' ','_') def id = page.elementText('id') println(title+ '('+counter+')') def revision = page.element('revision') def revid = revision.elementText('id'); def revtext = revision.elementText('text'); def contributor = revision.element('contributor') def username = contributor.elementText('username') def index = revtext.substring(0, Math.min(30,revtext.length())).toLowerCase().indexOf("redirect") counter++; if (counter < max && index < 0) { revtext = revtext.replaceFirst("^-", "*"); revtext = revtext.replaceAll("__","") revtext = revtext.replaceAll("[\\|][\\+]","") def buffer = new StringBuffer() buffer.append(revtext) try { def reader = new StringReader(revtext); def parser = new MediaWikiParser(); buffer = new StringBuffer() def listener = new XWikiSerializer(buffer); parser.parse(reader, listener); } catch (Exception e) { println(e.getMessage()) } def map = new HashMap() map.put('content', buffer.toString()) map.put('modifier', username) map.put('space','Wikipedia') map.put('title',title) try { proxy.confluence1.storePage(token, map) } catch (Exception e) { println(e.getMessage()) } } page.detach() // prune the tree } } def server = new XMLRPCServer() def proxy = new XMLRPCServerProxy("http://xwikiserver/xwiki/xmlrpc/confluence") def token = proxy.confluence1.login("","") def reader = new SAXReader() def handler = new PruningPageHandler(proxy, token) File f = new File("/home/slauriere/enwiki-20070908-pages-articles.xml.bz2.1.out") FileInputStream fis = new FileInputStream(f); reader.addHandler('/mediawiki/page', handler) reader.setEncoding('UTF-8') reader.read(fis)
Todos
- Fix pending WikiModel converter issues (tables, upper case, etc.)
- Work directly on a compressed file
- First letter of link should be upper case, see for instance "autism spectrum disorder" at http://en.wikipedia.org/w/index.php?title=Albedo&action=edit
- Issues on following pages:
- Albedo
- Adobe
- ...
Version 2.1 last modified by VincentMassol on 10/12/2007 at 18:56
Document data
Attachments:
No attachments for this document
Comments: 0