Remove Wiki spam messages

What happened to my Wiki?

Well, today my web site service provider informed me that my database size is more than the quota (200MB). I could not believe at first since wiki is mostly composed of texts and I never wrote that much yet. When I checked the table size (use the sum of Data_length column of SHOW TABLE STATUS), wow it was 5 GB and all discussion pages at my Wiki got heavily infected with spam messages like selling drugs, Viagra, etc. you know.

How to remove spam messages from the Wiki database?

MediaWiki, which is the back-end software for Wiki (as used in Wikipedia), supports MySQL or PostgreSQL database. My web server uses MySQL 4.1 which is the most stable version. I know a little bit on its database layout but could not simply figure out the way to remove such spam messages. Google only returned useless pages out of my interests. So here I record what I did to remove spams in the Wiki Discussion (or called Talk) pages.

Limitations

Since there are various spamming methods to Wiki sites, let me confine the environment to:

  • Your Wiki engine is MediaWiki.
  • Infected pages are Wiki talk pages as most Wiki users allows the anonymous writing for talk (or discussion) pages.
  • You simply want to delete ALL discussion pages.

Step-by-step instructions

Your working environment should be some MySQL console or such Web-based database admin tool (phpMyAdmin for my case). Let me remind you first that MediaWiki table definitions are different by version. So any table definitions (DDL) in this page should be modified to comply with that of your MediaWiki table definitions.

Based on the MediaWiki database layout, the first table you will work on would be text table. And its table size might be the biggest one if your Wiki has been attacked by spammers.

Clean text table

If your text table is too large, then just removing records may fail because web service providers set the time limit on PHP code execution or MySQL code execution. So I could not delete discussion texts but had to go through like below. In below SQL statements, *w_* is the prefix of table names. So modify table names like yours. If you are interested in the role of each table, refer MediaWiki database tables.

# create the backup text table that will store not-infected pages
CREATE TABLE w_text_backup (
  `old_id` int(10) unsigned NOT NULL,
  `old_text` mediumblob NOT NULL,
  `old_flags` tinyblob NOT NULL,
  PRIMARY KEY (`old_id`)
) ENGINE=InnoDB;

# From existing text table, copy all page texts except discussion page.
# Namespace 1 means the discussion page.
INSERT INTO w_text_backup
SELECT
   w_text.*
FROM
   w_page
   INNER JOIN w_revision ON page_id = rev_page
   INNER JOIN w_text ON rev_text_id = old_id
WHERE
   page_namespace <> 1;

# Now replace text table with the new one
DROP TABLE w_text;
RENAME TABLE w_text_backup TO w_text;

BE AWARE!! Above statements will remove ALL discussion pages. If you have any discussion pages to save, then you should manually export that pages before doing this. For that, visit the Wiki special pages to export pages in XML. After you finish all jobs below, then import that XML page into your wiki back.

Clean other tables

Since we deleted many texts, related records at other tables should be removed either.

DELETE w_revision
FROM
   w_page
   INNER JOIN w_revision ON page_id = rev_page
WHERE
   page_namespace=1;

DELETE w_recentchanges
FROM
   w_recentchanges
WHERE
   rc_namespace = 1;

DELETE w_externallinks
FROM w_page
INNER JOIN w_externallinks ON el_from = page_id
WHERE page_namespace = 1;

DELETE w_page
FROM
   w_page
WHERE
   page_namespace=1;

DONE!

Now my wiki size is reduced back to tens of megabytes from 5GB.

Reference

  1. MediaWiki database layout: http://www.mediawiki.org/wiki/Manual:Database_layout
  2. MediaWiki database tables: http://www.mediawiki.org/wiki/Category:MediaWiki_database_tables