mbox splitter
Submitted by msameer on Mon, 23/07/2007 - 12:56am
Just because I can't run sa-learn on my 15,000+ messages spam folder. It'll crash due to some hardware problems.
I thought that splitting the mailbox into smaller files will allow me to feed it to sa-learn.
I'm not sure something similar doesn't exist but I wrote mine anyway ;-)
The only problem is it consumes a lot of CPU and RAM, it was killed/crashed multiple times but it worked and allowed me to feed my spam mailbox to spamassassin!
Here it is in case someone needs it: split_mailbox.py. Needs python 2.5











Would Maildir help? I need to learn SA too.
Help in what ?
A similar tool exists in Git: git-mailsplit.
The Maildir format is the format introduced by Qmail for storing messages. It looks like
Maildir
|-cur
|-new
|-tmp
Where cur, new and tmp are directories. Each messages is stored in a separate file under one of these directories(new for unread, cur for read, tmp for temporary files during the delivery).
You could look for mbox2maildir utility (www.qmail.org should help) .
The memory consumption of this method is that high that even larger mailboxes cannot be processed. I tried to split a mbox file with 125,000 messages using 1024MB of RAM, but I always got MemoryError.
Can anyone give a better solution to this?
didn't box2maildir work for you ?