This work is done on 02/06/2007. For the recent updated work on newer Wikipedia databases are at EChronicle:Wikipedia database.
Send me an email: phkim AT ece DOT gatech DOT edu
Importing Wikipedia database requires the huge memory for fast insertion operation. Sample my.ini is below.
[mysqld]
basedir = "C:/APM_Setup/Server/MySQL5/"
datadir = "C:/APM_Setup/Server/MySQLData"
skip-bdb
#skip-innodb
innodb_file_per_table
connect_timeout = 100
max_connections = 6000
wait_timeout = 60
key_buffer = 512M
max_allowed_packet = 16M
table_cache = 6000
sort_buffer_size = 512M
read_buffer_size = 64M
myisam_sort_buffer_size = 512M
thread_cache = 8
query_cache_size= 16M
bulk_insert_buffer_size=64M
thread_concurrency = 4
Detail information to download Wikipedia database is available at Wikipedia database download. Use wget -c option to continue downloading even the connection is disconnected for a while. Download Windows wget at here
Use Xml2sql to convert XML dump into MySQL local dump file. After unpack the package, run xml2sql-fe.exe, set output format as mysqlimport, you will get page.txt, revision.txt, text.txt 3 files.
To enhance the speed to import very large data set, please read MySQL insert speed column. Myisamchk will only work for the MyISAM tables as its name represents.
Repeat the above steps equally to revision.txt and text.txt.
InnoDB mode is not recommeded to import the Wikipedia database, it may take over weeks using conventional PC.
Repeat the above steps equally to revision.txt and text.txt.
Wikipedia database is composed of many separated files. Main data files are in XML format and others are SQL source files having numerous INSERT comments in there. The problem is that these source codes sometimes include DROP commands or may use other ENGINE types like InnoDB that you may not want that to be. Then you need to filter out only INSERT codes from the source. This can be done easily by using the GREP utility. Windows version can be downloaded from Grep for Windows. Store following lines as a batch file (Ex. filter_sql.bat) and run it.
grep INSERT enwiki-20070206-page_restrictions.sql > filtered\enwiki-20070206-page_restrictions.sql
grep INSERT enwiki-20070206-user_groups.sql > filtered\enwiki-20070206-user_groups.sql
grep INSERT enwiki-20070206-interwiki.sql > filtered\enwiki-20070206-interwiki.sql
grep INSERT enwiki-20070206-langlinks.sql > filtered\enwiki-20070206-langlinks.sql
grep INSERT enwiki-20070206-externallinks.sql > filtered\enwiki-20070206-externallinks.sql
grep INSERT enwiki-20070206-templatelinks.sql > filtered\enwiki-20070206-templatelinks.sql
grep INSERT enwiki-20070206-imagelinks.sql > filtered\enwiki-20070206-imagelinks.sql
grep INSERT enwiki-20070206-categorylinks.sql > filtered\enwiki-20070206-categorylinks.sql
grep INSERT enwiki-20070206-pagelinks.sql > filtered\enwiki-20070206-pagelinks.sql
grep INSERT enwiki-20070206-oldimage.sql > filtered\enwiki-20070206-oldimage.sql
grep INSERT enwiki-20070206-image.sql > filtered\enwiki-20070206-image.sql
grep INSERT enwiki-20070206-site_stats.sql > filtered\enwiki-20070206-site_stats.sql
Procedure example for each sql file.
repeat above steps for all other filtered sql files. use follows batch processing.
ALTER TABLE categorylinks DISABLE KEYS; LOCK TABLES categorylinks WRITE; SOURCE enwiki-20070206-categorylinks.sql; UNLOCK TABLES; ALTER TABLE categorylinks ENABLE KEYS;
ALTER TABLE externallinks DISABLE KEYS; LOCK TABLES externallinks WRITE; SOURCE enwiki-20070206-externallinks.sql; UNLOCK TABLES; ALTER TABLE externallinks ENABLE KEYS;
ALTER TABLE image DISABLE KEYS; LOCK TABLES image WRITE; SOURCE enwiki-20070206-image.sql; UNLOCK TABLES; ALTER TABLE image ENABLE KEYS;
ALTER TABLE imagelinks DISABLE KEYS; LOCK TABLES imagelinks WRITE; SOURCE enwiki-20070206-imagelinks.sql; UNLOCK TABLES; ALTER TABLE imagelinks ENABLE KEYS;
ALTER TABLE interwiki DISABLE KEYS; LOCK TABLES interwiki WRITE; SOURCE enwiki-20070206-interwiki.sql; UNLOCK TABLES; ALTER TABLE interwiki ENABLE KEYS;
ALTER TABLE langlinks DISABLE KEYS; LOCK TABLES langlinks WRITE; SOURCE enwiki-20070206-langlinks.sql; UNLOCK TABLES; ALTER TABLE langlinks ENABLE KEYS;
ALTER TABLE oldimage DISABLE KEYS; LOCK TABLES oldimage WRITE; SOURCE enwiki-20070206-oldimage.sql; UNLOCK TABLES; ALTER TABLE oldimage ENABLE KEYS;
ALTER TABLE pagelinks DISABLE KEYS; LOCK TABLES pagelinks WRITE; SOURCE enwiki-20070206-pagelinks.sql; UNLOCK TABLES; ALTER TABLE pagelinks ENABLE KEYS;
ALTER TABLE site_stats DISABLE KEYS; LOCK TABLES site_stats WRITE; SOURCE enwiki-20070206-site_stats.sql; UNLOCK TABLES; ALTER TABLE site_stats ENABLE KEYS;
ALTER TABLE templatelinks DISABLE KEYS; LOCK TABLES templatelinks WRITE; SOURCE enwiki-20070206-templatelinks.sql; UNLOCK TABLES; ALTER TABLE templatelinks ENABLE KEYS;
ALTER TABLE user_groups DISABLE KEYS; LOCK TABLES user_groups WRITE; SOURCE enwiki-20070206-user_groups.sql; UNLOCK TABLES; ALTER TABLE user_groups ENABLE KEYS;
Open the web browser and run config/index.php under Wikipedia directory
After successful DB building, you should be aware of SQL structures to query information which is available at Database Layout.
The current wikitext of a given page in the main namespace:
SELECT
old_text
FROM
page
INNER JOIN revision ON page_latest = rev_id
INNER JOIN text ON rev_text_id = old_id
WHERE
page_title = 'Your page title'
AND
page_namespace=0;