Writings Photos Code Contact Resume
mbox splitter

You are here

Submitted by msameer on Sun, 22/07/2007 - 10:56pm

Just because I can't run sa-learn on my 15,000+ messages spam folder. It'll crash due to some hardware problems.
I thought that splitting the mailbox into smaller files will allow me to feed it to sa-learn.
I'm not sure something similar doesn't exist but I wrote mine anyway ;-)

The only problem is it consumes a lot of CPU and RAM, it was killed/crashed multiple times but it worked and allowed me to feed my spam mailbox to spamassassin!

Here it is in case someone needs it: split_mailbox.py. Needs python 2.5

Comments

Submitted by Amr Gharbeia (not verified) on Sun, 22/07/2007 - 11:23pm

Would Maildir help? I need to learn SA too.

Submitted by msameer on Sun, 22/07/2007 - 11:33pm

Help in what ?

Submitted by Josh Triplett (not verified) on Mon, 23/07/2007 - 2:39am

A similar tool exists in Git: git-mailsplit.

Submitted by Delian Krustev (not verified) on Wed, 15/08/2007 - 2:01pm

The Maildir format is the format introduced by Qmail for storing messages. It looks like

Maildir
|-cur
|-new
|-tmp

Where cur, new and tmp are directories. Each messages is stored in a separate file under one of these directories(new for unread, cur for read, tmp for temporary files during the delivery).

You could look for mbox2maildir utility (www.qmail.org should help) .

Submitted by Christoph (not verified) on Wed, 12/09/2007 - 10:08am

The memory consumption of this method is that high that even larger mailboxes cannot be processed. I tried to split a mbox file with 125,000 messages using 1024MB of RAM, but I always got MemoryError.

Can anyone give a better solution to this?

Submitted by msameer on Wed, 12/09/2007 - 12:38pm

didn't box2maildir work for you ?

Add new comment

Subscribe to /  digg  bookmark