Importing Wikipedia February 6th 2007

Recent works

This work is done on 02/06/2007. For the recent updated work on newer Wikipedia databases are at EChronicle:Wikipedia database.

Have any questions?

Send me an email: phkim AT ece DOT gatech DOT edu

Download MediaWiki

  • First install Apache+PHP+MySQL. English users may visit EasyPHP. Korean users may use APMSETUP.
    • Note! MySQL 5.0.22 passed this test. MySQL 5.1 does not passed the test. It does not handle the Wikipedia text table properly.
  • Download Wikipedia engine from MediaWiki. At the time of experiment, I used MediaWiki 1.9.1 version.
  • Unpack under http document directory. This depends on your web server.
  • Do not run config/index.php at this time. You will configure Wikipedia at the end. If you run config, then it will put initial setup data into the database which will conflict with Wikipedia data that we will dump from the Wikipedia database.
  • Copy tables.sql under wikipedia/maintenance/tables.sql to your mysql/bin directory or add MySQL bin path to the current path setting.
  • Depending on the MediaWiki version, tables.sql may contain InnoDB definitions. In this case, you need to manually replace “Type=InnoDB” texts into “Type=MyISAM”. You may want to save the file with the new file name.
  • Run mysql client
  • Run following sql commands
    1. CREATE DATABASE enwiki_20070206; // you may change the database name as you wish
    2. USE enwiki_20070206;
    3. SOURCE tables.sql; // If you change the file name above, then use that.

Setup MySQL

Importing Wikipedia database requires the huge memory for fast insertion operation. Sample my.ini is below.

[mysqld]        basedir = "C:/APM_Setup/Server/MySQL5/"    datadir  = "C:/APM_Setup/Server/MySQLData"        skip-bdb    #skip-innodb    innodb_file_per_table        connect_timeout = 100    max_connections = 6000    wait_timeout = 60        key_buffer = 512M    max_allowed_packet = 16M    table_cache = 6000    sort_buffer_size = 512M    read_buffer_size = 64M    myisam_sort_buffer_size = 512M    thread_cache = 8    query_cache_size= 16M    bulk_insert_buffer_size=64M        thread_concurrency = 4   

Download Wikipedia database

Detail information to download Wikipedia database is available at Wikipedia database download. Use wget -c option to continue downloading even the connection is disconnected for a while. Download Windows wget at here

Convert the Wikipedia database into SQL dump

Use Xml2sql to convert XML dump into MySQL local dump file. After unpack the package, run xml2sql-fe.exe, set output format as mysqlimport, you will get page.txt, revision.txt, text.txt 3 files.

Process to import local data dump files into the database

MySQL4 backward-compatible mode

To enhance the speed to import very large data set, please read MySQL insert speed column. Myisamchk will only work for the MyISAM tables as its name represents.

  • Disable indexes: myisamchk –keys-used=0 -rq ..\data\enwiki-20061130\page
  • Change to MySQL shell
  • Use database: mysql> use enwiki-20061130;
  • Flush tables: mysql> FLUSH TABLES;
  • Change the MyISAM file size limit. Check to change the limitation. This problem happens when a single MyISAM data file size bigger than 4GB: mysql> ALTER TABLE page MAX_ROWS=1000000000;
  • Load local data file: mysql> LOAD DATA LOCAL INFILE ‘page.txt’ INTO TABLE page;
  • Reconstruct indexes: Read myisamchk and myisamchk memory usage to configure the parameters: myisamchk –fast –force –update-state –key_buffer_size=512M –sort_buffer_size=512M –read_buffer_size=64M –write_buffer_size=64M -rq ..\data\enwiki-20061130\page

Repeat the above steps equally to revision.txt and text.txt.

MySQL5 InnoDB mode

InnoDB mode is not recommeded to import the Wikipedia database, it may take over weeks using conventional PC.

  • To import bulky data into the InnoDB table space, it is recommended to disable keys during insertion. In the mysql client, ALTER TABLE page DISABLE KEYS;
  • To use the output of xml2sql-fe.exe, use mysqlimport MySQL client program under mysql/bin. ex) C:\Downloads\enwiki-20070206>mysqlimport -u root -p –local enwiki_20070206 page.txt // enwiki_20070206 is your database name. page.txt content will be imported into the same file name table, page.
  • Now enable keys: ALTER TABLE page ENABLE keys;

Repeat the above steps equally to revision.txt and text.txt.

Process to import other data SQL files

Wikipedia database is composed of many separated files. Main data files are in XML format and others are SQL source files having numerous INSERT comments in there. The problem is that these source codes sometimes include DROP commands or may use other ENGINE types like InnoDB that you may not want that to be. Then you need to filter out only INSERT codes from the source. This can be done easily by using the GREP utility. Windows version can be downloaded from Grep for Windows. Store following lines as a batch file (Ex. filter_sql.bat) and run it.

  • filter_sql.bat contents

grep INSERT enwiki-20070206-page_restrictions.sql > filtered\enwiki-20070206-page_restrictions.sql grep INSERT enwiki-20070206-user_groups.sql > filtered\enwiki-20070206-user_groups.sql grep INSERT enwiki-20070206-interwiki.sql > filtered\enwiki-20070206-interwiki.sql grep INSERT enwiki-20070206-langlinks.sql > filtered\enwiki-20070206-langlinks.sql grep INSERT enwiki-20070206-externallinks.sql > filtered\enwiki-20070206-externallinks.sql grep INSERT enwiki-20070206-templatelinks.sql > filtered\enwiki-20070206-templatelinks.sql grep INSERT enwiki-20070206-imagelinks.sql > filtered\enwiki-20070206-imagelinks.sql grep INSERT enwiki-20070206-categorylinks.sql > filtered\enwiki-20070206-categorylinks.sql grep INSERT enwiki-20070206-pagelinks.sql > filtered\enwiki-20070206-pagelinks.sql grep INSERT enwiki-20070206-oldimage.sql > filtered\enwiki-20070206-oldimage.sql grep INSERT enwiki-20070206-image.sql > filtered\enwiki-20070206-image.sql grep INSERT enwiki-20070206-site_stats.sql > filtered\enwiki-20070206-site_stats.sql

Procedure example for each sql file.

  • After import all data files. Save following lines as a SQL source file: TRUNCATE TABLE `categorylinks`;
  • LOCK TABLES `categorylinks` WRITE;
  • SOURCE enwiki-20061130-categorylinks-filtered.sql;

repeat above steps for all other filtered sql files. use follows batch processing.

ALTER TABLE categorylinks DISABLE KEYS; LOCK TABLES categorylinks WRITE; SOURCE enwiki-20070206-categorylinks.sql; UNLOCK TABLES; ALTER TABLE categorylinks ENABLE KEYS;  ALTER TABLE externallinks DISABLE KEYS; LOCK TABLES externallinks WRITE; SOURCE enwiki-20070206-externallinks.sql; UNLOCK TABLES; ALTER TABLE externallinks ENABLE KEYS;  ALTER TABLE image DISABLE KEYS; LOCK TABLES image WRITE; SOURCE enwiki-20070206-image.sql; UNLOCK TABLES; ALTER TABLE image ENABLE KEYS;  ALTER TABLE imagelinks DISABLE KEYS; LOCK TABLES imagelinks WRITE; SOURCE enwiki-20070206-imagelinks.sql; UNLOCK TABLES; ALTER TABLE imagelinks ENABLE KEYS;  ALTER TABLE interwiki DISABLE KEYS; LOCK TABLES interwiki WRITE; SOURCE enwiki-20070206-interwiki.sql; UNLOCK TABLES; ALTER TABLE interwiki ENABLE KEYS;  ALTER TABLE langlinks DISABLE KEYS; LOCK TABLES langlinks WRITE; SOURCE enwiki-20070206-langlinks.sql; UNLOCK TABLES; ALTER TABLE langlinks ENABLE KEYS;  ALTER TABLE oldimage DISABLE KEYS; LOCK TABLES oldimage WRITE; SOURCE enwiki-20070206-oldimage.sql; UNLOCK TABLES; ALTER TABLE oldimage ENABLE KEYS;  ALTER TABLE pagelinks DISABLE KEYS; LOCK TABLES pagelinks WRITE; SOURCE enwiki-20070206-pagelinks.sql; UNLOCK TABLES; ALTER TABLE pagelinks ENABLE KEYS;  ALTER TABLE site_stats DISABLE KEYS; LOCK TABLES site_stats WRITE; SOURCE enwiki-20070206-site_stats.sql; UNLOCK TABLES; ALTER TABLE site_stats ENABLE KEYS;  ALTER TABLE templatelinks DISABLE KEYS; LOCK TABLES templatelinks WRITE; SOURCE enwiki-20070206-templatelinks.sql; UNLOCK TABLES; ALTER TABLE templatelinks ENABLE KEYS;  ALTER TABLE user_groups DISABLE KEYS; LOCK TABLES user_groups WRITE; SOURCE enwiki-20070206-user_groups.sql; UNLOCK TABLES; ALTER TABLE user_groups ENABLE KEYS; 

Configure Wikipedia

Open the web browser and run config/index.php under Wikipedia directory

Query information

After successful DB building, you should be aware of SQL structures to query information which is available at Database Layout.

The current wikitext of a given page in the main namespace:

   INNER JOIN revision ON page_latest = rev_id
   INNER JOIN text ON rev_text_id = old_id
   page_title = 'Your page title'