# # CRM114_Mailfilter_HOWTO.txt - The CRM114 & Mailfilter HOWTO # # Copyright 2003-2009 William S. Yerazunis. # This file is under GPLv3, as described in COPYING. # The CRM114 & Mailfilter HOWTO -Bill Yerazunis, 2003-09-18 (last update 2009-03-02) This is the CRM114 Mailfilter HOWTO. It describes how to set up CRM114 and Mailfilter to filter your incoming mail, as of the version CRM114-20060209-ReaverSecondBreakfast. This HOWTO doesn't describe _how_ CRM114, Mailfilter, Mailtrainer, or Mailreaver works. This just will set you up enough so that you can start using CRM114 and Mailfilter to filter your mail. It assumes you are running on a Linux box; getting the system running on *BSD, MacOS, or Windows will require considerably more work than we describe here (and is a subject for future HOWTOs). ------------------------------------------------------ Remember, the CRM114 package is released under the GPL (license is enclosed in any of the downloads). There is NO WARRANTY WHATSOEVER for this software to be useful in any way; it's going to tamper with your incoming mail and you can easily imagine the dangers in that. ---------------------------------------------------------- That said, I hope CRM114, Mailreaver, and Mailreaver is useful to you; it's been very useful to me. It's been keeping my mailbox clear of clutter for since 2002; I'm convinced it has better performance than I-the-human at killing spam without accidentally deleting important mail. I've tested myself, and I-the-human is only about 99.7% or 99.8% accurate at best; CRM114 is considerably more accurate than that - easily two or three times more accurate. (as of December 2003, it was 99.95% accurate (N+1 statistics) on my incoming mail stream to a non-business account. Something to Remember: CRM114 is a *language* designed to write text filters and classifiers in. It makes it easy to tweak code. Mailfilter is just _one_ of the possible filters; there are many more out there and if Mailfilter doesn't do what you want, it's easy to create one that does. Mailreaver is another one of the filters, with different (and better, I hope) designs, that can use Mailtrainer (yet another filter) to build even better statistics files. There are yet other filters written in CRM114; you can read all about them on the web page: crm114.sourceforge.net (and if you create one, and want to share it, put it on a web page and send me an email so I can add a pointer.) - Bill Yerazunis (wsy@merl.com) ------------------------------------------------------------------- Step 0: Scientes Inamicae (Know Thy Enemy) These are the major steps in using CRM114 Mailfilter. The steps are pretty simple: 1) Downloading what you need (it's just 1 or 2 megabytes in a single .gz file) 2) Setting up the executables (not more than ten commands to type, even if you're building from the fresh source) 3) Configuring Mailfilter or Mailreaver (editing one file, most likely change is ONE line, and we tell you which one) 3) Setting up the needed auxilliary files (not more than 2 files to edit of no more than 5 lines each, plus typing one or two commands) 5) Engaging Mailfilter (if you are using Procmail, this is cut-and-paste about ten lines, otherwise it's create one file containing one line, and typing up to three commands) 6) Training CRM114 and Mailfilter (whenever you get an error, you send it back to yourself, using your current mail tool. How hard can that be? Now, you can also use mailtrainer to bulk-train in whole directories of your old spam and good email.) 7) Adding Priority Lists, Whitelists, and Blacklists Mailfilter supports whitelists, blacklists, term rewriting, and some other features. You can use these for "guaranteed delivery" from people you really trust - or really hate. 8) Useful Utilities Details on the cssutil, cssdiff, and cssmerge utilities. You don't need to know this, but you may find it useful. ------------------------------------------------------------------------- Step 1: Downloading. Get yourself a copy of a CRM114 kit. The kits can always be found by visiting the CRM114 homepage at: http://crm114.sourceforge.net You will need at least the statically-linked binary kit (compiled to run on any i386 or better Linux box); for best performance it is suggested you get the source kit and compile it on the processor you will be running CRM114 on. If you do not have root privs on the box you will be running CRM114 on, it is suggested you stay with the statically linked binaries (this is because the recommended "TRE" REGEX library requires either root to install, or major workaround mojo). The kits are named: crm114-.i386.tar.gz (statically linked binaries) and crm114-.src.tar.gz (complete source code + tests) These kit .gz files are fairly small; usually less than one megabyte (currently around 800 Kbytes) so they will download quickly. You will need to decide if you will be starting off with a pre-learned set of .css files (.css means CRM114 Sparse Spectra) or if you will be creating your own .css files from your own samples of spam and nonspam. You can think of a .css file as being a "cerebral memory" of what a particular kind of mail (good or spam) looks like; .css files are how CRM114 remembers what spam and good mail look like. With empty .css files a CRM114 system acts like a total amnesiac - it has absolutely no conception of "good" or "bad". In general, the pre-learned .css files will give you an initially more accurate filter, but after some use and training the self-created filter files will catch up with pre-learned files, and then the two filters will achieve about equal accuracy. However, there may be some "glitches" in the mid-term while some edge cases in the prelearned files are _unlearned_. If you decide not to take our advice, you will also need to download a set of pretrained .css files like these: crm114-.css.tar.gz The .css files are rather large; this download may approach 50 megabytes. (currently it's 8+ megabytes) Download the kits you will need (at least one of .src.tar.gz or .i386.tar.gz or .i386.rpm) and then proceed to "Step 2: Setting Up the Executables" -------------------------------------------------------------------------- Step 2: Setting Up the Executables In this step, you will install four binaries into your system. The four binaries are: crm - the CRM114 "compute engine". It's called "crm" because "crm114" is too hard to type. cssutil - the .css file check/verify/edit program cssdiff - the .css file diff program cssmerge - the .css file merging program One important point: do NOT install CRM114 or any of it's utilites setuid or sgid to root. If you do, that's just an invitation for someone to utterly hose your system without even trying. We're not talking an intentional attack, just an inadvertent command or script gone wierd could do it. This is also why we recommend using a _static_ linking of the executable, so that a LD_LIBRARY_PATH attack can't falsely insert a subversive version of a library. ----- There are three ways you can set up these executables. You can: a) install with a .rpm kit b) install with a .i386.tar.gz (tarball of statically linked binaries) c) install with a .src.tar.gz (tarball of complete source) Note 1: If you do not have root on the machine you are installing on, you may have some problems during the installation. You may want to reconsider using the statically linked binaries instead of compiling from sources. ----- Step 2 Method A: Installing from .i386.tar.gz First, untar the binary release. Type: tar -zxvf crm114-.i386.tar.gz You should now become root. If you do not have root on your machine, you _can_ execute CRM114 programs directly from your home directory, by changing your $PATH appropriately; see your shell man page for how to do this for your particular shell (it varies with the shell, so I can't tell you here how to do it) and skip to the end of this step. Or- you can run the binary explicitly from your current directory by invoking it as ./crm114_tre. If you're installing, become root, then type: cd crm114- make install_binary_only This will install the pre-built binaries of CRM1114 and the utilities into /usr/bin. This is the default install location for CRM114. If you want them installed in a different place, edit the Makefile and change BINDIR (near the top of the Makefile) to a different directory. Note that if you type "make clean" you'll _delete_ your prebuilt binaries, so don't do that! Now, you can test your work. Type crm -v which will cause CRM114 to print out the version of itself you just installed. You can also run a quick "Hello, world!" by typing: crm '-{ output /Hello, world! This is CRM114 version :*:_crm_version: .\n/}' then hit ^D (end-of-file on *nix). You;ll get back a response like: Hello, world! This is CRM114 version 20040118-BlameEric . Congratulations! You've now completed the installation of CRM114 and utilities from prebuilt binaries. Proceed to "Step 3: Setting Up Needed Files. ----- Step 2 Method B: Compiling from .src.tar.gz (source) This method is the most complex. Start by uncompressing and untarring the big .src.tar.gz with the command: tar -zxvf crm114-.src.tar.gz Now cd down into the crm114- directory. You will see many files here. You now have a choice: you can build CRM114 with either the GNU regex libraries (not recommended, as GNU regex can't handle embedded NULL bytes and has other issues), or with the TRE regex library (recommended; this is what you get with the precompiled binary kit). By default, you will use the TRE regex library; however, this means you have to build and install TRE. You can either grab the most recent version from the TRE homepage at http://laurikari.net/tre, OR you can use the version that is pre-packaged with your CRM114 download. (The pre-packaged version is tested against CRM114 and will have all appropriate patches installed, while the fresh one may have new features. Take your choice- it's good stuff either way) Fortunately, building and installing TRE is easy. The TRE regex library will peacefully coexist on the same system as the GNU regex library. Caution: if you are building from sources, you should install the TRE regex library ***first***. TRE is the recommended regex library for CRM114 (fewer bugs and more features than Gnu Regex). To install TRE, become root, then type this ( BIG BIG WARNING - DO NOT FORGET to tell configure to "--enable-static" ) : cd crm114- cd tre- ./configure --enable-static make make install You have now installed the TRE regex library as /usr/local/lib/libtre . If you make a mistake and need to rerun the make commands, be aware that in some versions of TRE, a 'make clean' command will delete test files that are needed when running the build process again. Unfortunately, the safest course of action is to delete the TRE source directory and restore it from the tar ball. Depending on your choices in static versus dynamic linking, you _may_ need to also add /usr/local/lib to /etc/ld.so.conf, and then run ldconfig as root. Or not. If, during the next steps, you get annoying messages on the order of "can't find ltre" then this is the thing to try. Once TRE is built and installed you can compile CRM114 and the accompanying utilities (cssutil, cssdiff, and cssmerge). By default, CRM114 installs into /usr/bin (_not_ /usr/local/bin - if you want to change this, change the definition of BINDIR near the top of the file "Makefile"). Cssutil gives you some insight into the state of a .css file, cssdiff lets you check the differences between two .css files, and cssmerge lets you merge two css files. Change directory back up to the CRM114 directory, then become root, then (noting that no .configure step is necessary; the CRM114 Makefile is self-contained and presupplied) type: cd .. make This will compile and link the CRM114 executable and the utilities. You can test this executable if you want. Just type: make megatest which will run for about a minute and exercise most of the code paths inside CRM114. This tests the version of CRM114 in your local directory. Note that this only works if you've installed the TRE engine. The GNU regex engine has enough "fascinating behaviors" that it will get a lot of things wrong; the GNU regex package also doesn't handle approximate regexes at all, and since those are in the test set, you'll error out on each of those as well. If "megatest" reports any differences between the supplied "megatest_knowngood.log" and your own results, OTHER than on lines tht say "OK_IF_blahblahblah" results, please file a bug report to me and we'll figure out what went wrong. If you are happy with the executable, type make install This will install the executable into /usr/bin/crm (by default). If you want another install location, you can change it in the Makefile. You can now check to see that the install version by: crm -v and CRM114 will report back the version of the install. You can also run a quick "Hello, world!" by typing: crm '-{ output /Hello, world! This is CRM114 version :*:_crm_version: .\n/}' then hit ^D (end-of-file on *nix). You;ll get back a response like: Hello, world! This is CRM114 version 20040118-BlameEric . Congratulations! You've now completed the installation of CRM114 and utilities from source. Move on to the next step - "Step 3: Setting Up Your .CSS Files" . ------------------------------------------------------------------------ ------------------------------------------------------------------------ Step 3: Configuring Mailfilter or Mailreaver In this step you will tell Mailfilter or MailReaver what you want it to do with your mail. All of the options are controlled by editing one file, named "mailfilter.cf" . Mailfilter and MailReaver use most of the same flags (and all of the same important ones) so both use the same mailfilter.cf file. By default, both Mailfilter and Mailreaver look for the file mailfilter.cf in the initial directory. If you want to change that, use "--fileprefix=/some/where/else/" on the command line, so these filters will look for mailfilter.cf (and the other runtime filtering files!) in the "/some/where/else/" directory. This --fileprefix mode is handy when you are setting up many users. (remember to use a final closing slash on the directory name or you will end up nowhere) The format of mailfilter.cf itself is pretty simple. 0) blank lines are OK. 1) comments start with a # in column 1. 2) Anything not a comment is a var setting, in the format: :var_to_set: /Value_to_set_goes_here/ All of the user-settable configuration vars have setup lines in mailfilter.cf, and you only need to change three lines for a "default" average setup: one is a password you make up, and the other two have only a few possibilities each, and we list those possibilities for you. The Three Things you MUST do in mailfilter.cf : 1: First, you MUST change the secret password. This is defined near the top of the file. Your password may contain a-z, A-Z, 0-9, but no blanks or punctuation (at least for now). You _must_ set this password to something not easily guessable. If you don't set it, you won't be able to use mailfilter's remote commanding facility. 2: Second, you MUST set whether to use base64 decodes, or not, and if so, which decoder your system supports. Just type the options into BASH, one at a time (like "mewdecode ") and use the first one that doesn't give you an error message. 3: Third: you MUST set the cache_dupe_command according to whether your system supports linking (as in, has an "ln" command; *NIX does, but Windows doesn't) or whether full copies of texts need to be used in the reaver. Other than that, everything else in the mailfilter.cf file can be left alone, at least for initial testing. At first, you will probably want to leave the "log_to_allmail.txt" enabled while you get used to CRM114. Likewise, leave "log_rejections" set to yes as well; that way you can easily see (with "less" or "tail") just what is being rejected. Once you get more experience with CRM114, you can set these to "no" and not use up disk space in these "extra safety" logs. You can skim-read the rest of mailfilter.cf . There are three typical cases for most users: 1) If you ARE using Procmail or another filtering MDA: --> You probably will NOT need to change any of the other options. 2) If you ARE NOT using Procmail, but your mail reading program can sort out email into folders based on whether the SUBJECT header contains the telltale string "ADV:" (most mail readers can do this): --> You probably will NOT need to change any of the other options. 3) You are NOT using Procmail, and your mail reading program is "dumb" (cannot sort email into folders based on subject line): --> You probably will want to define a separate account that will recieve all spam caught (otherwise, you'll just get all your spam delivered as usual, with additional headers telling you it was spam). To do this, look down to ":general_fails_to:". Insert the full username@domainname.tld mail address where you want your spam to be sent. Note on mime decoders: There are a number of them available; the defaults given in mailfilter.cf may or may not be valid on your system. Further, it may have a different path than the default given in mailfilter.cf. Yet further, you may want to load your own, like "normalizemime" (see the crm114.sourceforge.net web page for details on the download). You can also configure the verboseness (or not) of your filtered results. You can go from "no changes" (not even a statistical label in the headers) to complete results including an expansion of any base64 texts and HTML decommented strings. Feel free to change things to get the look and feel you want; after all, what good is open source if you don't change it? :-) HOWEVER, Please don't muck with variables that aren't in the mailfilter.cf file. "You make a mess, you clean it up." :-( After making these changes, write out "mailfilter.cf". You may later go back and change the configuration options, but the options as already set are good for most users. You do not need to do anything to "load in" the new options, as CRM114 reads them in fresh from the file during initialization for each email. Now, proceed to "Step 4: Setting Up Other Needed Files" . -------------------------------------------------------------------- -------------------------------------------------------------------- Step 4: Setting Up Other Needed Files Now that the crm114 language is working, you need to set up your .css files, your rewrites.mfp file, and your priolist.mfp file. All of these files need to exist (either by being there, or by being symlinked to) the directory where CRM114 will "run in" when an actual mail comes in. Usually this is your per-user directory on the mail server (if your mail server is also your home directory, then it's there.). If this is inconvenient, you can use the --fileprefix option on the command line to tell CRM114 to "change over" to a different directory. The files that need to be in the home (or --fileprefix) directory are: rewrites.mfp spam.css nonspam.css priolist.mfp blacklist.mfp [ only for mailfilter; mailreaver ignores it ] whitelist.mfp [ only for mailfilter; mailreaver ignores it ] Here's a quick overview of these files; we'll get into the details further on. If you are in a hurry, you can have *empty* files for all four of the .mfp files and things will still work reasonably well (and you can upgrade later). You DO need to create the proper-sized .css files, though, or you won't be able to classify email at all (depending on your setup, it may be discarded, may be returned to sender, or may actually just get mangled and forwarded. None of these are a good thing, in the long run) --- Summary of each file --- [[ rewrites.mfp ]] The rewrites.mfp file controls how to "rewrite" incoming email so that your incoming email conforms more closely to what might be considered "archetypal". The rewrites.mfp setup is optional; if you build your own .css files (either from empty files, or from corpora) you can actually replace rewrites.mfp with an empty file; you just won't be able to share your .css files with anyone else. [[ spam.css and nonspam.css ]] The .css files themselves ( CRM114 Sparse Spectra files) are the "memory" that crm114 uses to statistically describe the words and phrases that characterize various kinds of mail. Although it depends on the classifier you are using, by default the .css files are in a hashed binary format and it is not easy (or sometimes, even possible!) to reconstruct your email from the .css files. However, it *is* possible to determine from the .css files if certain words or phrases have ever been trained into your classifier, so .css files do have some possible security implications. DMCA note: CRM114 statitics files are "effectively encrypted" according to the provisions of the DMCA - all parties are hereby notified that the copyright owner/author of any particular statistics file (.css , .cfc, .cor, .cwc, .chp, or other) is the creator of that file, not the author(s) of CRM114 itelf, and said creator may invoke the draconian punishments of the DMCA on any party attempting to extract the encoded information without prior approval. So there. [[ priolist.mfp ]] The priolist.mfp file is a sequential list of tests to be run; each test starts with a + or a - (thumbs up or thumbs down), then a regex pattern; if the pattern matches, the mail is either accepted unconditionally or sent to the spam bucket unconditionally. Then, The blacklist.mfp, and whitelist.mfp are "match this, you're spam" and "match this, you're good" regex pattern sets. If this seems redundant, you're right; all you need is priolist.mfp, but enough folks have historically requested "blacklists" and "whitelists" as an explicit marketing checkoff that we've put them into mailfilter.crm. Priolist.mfp is the preferred method of doing blacklists and whitelists now; if a P.H.B. asks "does it have blacklists and whitelists", you can now say "yes, and they're even _prioritized_ blacklists and whitelists!". Step 4 Part 1 - Setting up the Rewrites file. To set up the rewrites.mfp file, edit the file "rewrites.mfp" and replace the placeholders (in this case, "wsy", "merl.com", and "mail.merl.com") with your corresponding username, domain name, and mail server information. These rewrite rules will be used to "scrub" your sample text of user-specific strings. (note that this is only strictly necessary if you want to use the pre-built .css files. However, it is in general recommended, so that you can "share/merge" your .css files with your friends.) Note the "arrowheads" in the file. They look like this: >-> or >--> This is a rewrite operator. Anything that matches the regex on the left-hand side of the arrowhead will be replaced with the text on the right-hand side of the arrowhead. (the "arrowheads" that have one hyphen in them will rewrite only if the entire left-hand match is found on a single line; if you use two hyphens, to make a ">-->" instead of ">->" then the left-hand match can be multi-lined.) Example: if your name was Agent Smith, your email account AgentSmith@the.matrix.net, and your mail router was mail.matrix.net at IP address 192.168.10.5, then the rewrites.mfp file should look like: AgentSmith@the.matrix.net>->MyEmailAddress [[:space:]]Agent Smith>-> MyEmailName mail.matrix.net>->MyLocalMailRouter 192.168.10.5>->MyLocalMailRouterIP The idea is to turn your email headers into headers that don't refer to any of your own actual name, address, etc, but contain only the strings "MyEmailAddress", "MyEmailName", "MyLocalMailRouter", and "MyLocalMailRouterIP". If you have more than one incoming email name , email address, server, router, etc, add lines in rewrites.mfp for each email name, email address, server, router, and so forth. This is something you really _should_ do, if you have more than one email path leading to the account that leads to an account that is being filtered by CRM114 (if you don't, a lot of learning will have to be repeated for each path, which will cost you accuracy and use up valuable feature slots in the .css files that you could use in more valuable ways otherwise. On the other hand, if you have multiple email addresses that all channel through one CRM114 fileset, and the addresses recieve very different ratios of spam and nonspam (or, very differnt *types* of spam), then it _might_ be to your advantage to not use rewrites.mfp, (just replace it with an empty file), so that the extra statistical information of the incoming email address is not lost) If all this confuses you to no end, just make rewrites.mfp be an empty file and everything should decently well. ----- Step 4 Part 2 - Setting up the .CSS files You have a choice here. You can either build your own files from your own spam and nonspam email, or you can use the pre-learned .css files available from crm114.sourceforge.net . We recommend that you build your own files dynamically, as that will result in the best final accuracy. In either case your .css files should be in the same directory as your mailfilter will "run" in (as we mentioned above, default is your home directory on your mailserver). The particular directory that the mailfilter "runs" in is variable and depends on your local setup. Assuming you will use the ".forward" hook, there are two likely situations. If your mail service runs on your local machine (say, you have just one machine - and I do hope you have a firewall in that case), then mailfilter will almost certainly "run" in your home directory- the directory you're in when you log in. If your mail service runs on a mail server (not your local machine), then you will probably have a "home directory" on that machine as well, and that's the directory that the mail filter will run in. If neither of these is the case, you should ask your system administrator what the correct directory is. ----- Step 4 Part 2 Method A - Build Your Own Empty .CSS Files This method will give you the best final accuracy, but you will spend more time training. This is the recommended method for users wanting the best accuracy. To start from scratch, you need to create empty .css files. The cssutil program will do that for you. Just type: cssutil -b -r spam.css cssutil -b -r nonspam.css and you will have created _empty_ spam.css and nonspam.css files in your current directory (that is, the files are full-size, but contain no information. They'll be full of binary zeroes). Once you have these empty files you will have a high (50% or so) error rate for the first few hours, till you have 'taught' CRM114 what your particular mix of spam and nonspam looks like. Proceed below to "Step 4: Configuring Mailfilter". Many people want to "preload" their spam collection into CRM114. This used to be a bad idea. CRM114 is optimized for TOE learning - "Train Only Errors" learning; testing something like a quarter of a million test cases has proven that it is better to train only errors, and _only_ _as_ _they_ _occur_, than to preload a bulk database into CRM114. Note that the previous paragraph says "used to be". The new program "mailtrainer.crm" can do rapid TOE or DSTTTR training and build your .css files out of stored spam and good mail collections. You can read all about mailtrainer.crm in Appendix 1 of this document. If you're wondering, the statistics from the "torture test" (about 40,000 messages) are that training _only_ errors, in realtime, will give about 2.1 times better accuracy than force-training a big corpus, even if the messages are the same messages and presented in the same order. The "why" is mathematically complicated, but there's an intuitive description in the FAQ. Again: you will achieve the best possible accuracy if you let CRM114 itself make errors that you correct in real time. ----- Step 4 Part 2 Method B - Pre-LEARNed files: This is the simplest method, but less accurate than method A. If you choose to use the pre-learned .css files, you need to download the appropriate crm114 .css.tar.gz file, and then you can just type: tar -zxvf crm114-.css.tar.gz and you'll get the two files "spam.css" and "nonspam.css" in your current directory. Note that the download is fairly large - between 8 and 50 megabytes, and although this will give you a good starting point for your own statistics, you will have a better (smaller, faster) final configuration if you build your own .css files from scratch. ----- Step 4 Part 2 Method C - BETA TEST - Using mailtrainer.crm to Build .CSS Files New in 20060101 is the "mailtrainer.crm" program. This program accepts two directories of "archetype" good and spam email, and runs an interative training procedure to produce some very high quality .css files from these examples. The example files need to be "SMTP Virgin" - that is, exactly what was recieved at SMTP time by your mail server, with _nothing_ changed. (any changes will affect accuracy, probably negatively) The mailtrainer training will typically take something like 1 to 10 minutes per 1000 messages in your training set. Mailtrainer.crm will create your spam.css and nonspam.css files automatically. Mailtrainer.crm will also read your mailfilter.cf configuration file, and rewrites.mfp, so be sure to set up those files _first_ (if you're doing things in order, you're in good shape). The full description of how to use mailtrainer.crm is in Appendix 1 at the end of this document. So, jump there, read Appendix 1, run mailtrainer.crm, and then proceed to the next section- checking your .css files. ----- Step 4 Part 2 Method D - ALPHA TEST -- MAKEFILE Build And Preload .CSS Files From Fresh Spam and Nonspam CAUTION - this applies ONLY to kits 20060606 and later!!! DO NOT DO THIS if you are running a pre-20060606 makefile! It will hose you! If you, by any chance, happen to have un-altered examples of spam and nonspam, you can use these to pre-build a set of .css files. (As of versions 20060606 and later ONLY. Previous versions had a bad implementation of this that took different arguments and tended to produce bloated .css files that didn't function well. Post 20060606, the mailtrainer system is used and that works very well indeed) You also need to be sure your emails are "SMTP Virgin" - that is, they are exactly as recieved at SMTP time, not with headers or footers added or taken out by your mail delivery agent or your mail reading program. (if this isn't true, the headers will be rather bogus and you will lose significant accuracy and you should use method A above instead). If you are OK with this, here's what to do: 1) Put copies (or symlinks/hardlinks) to all of your example spam into a subdirectory named spam.dir in the local directory. 2) Put copies (or symlinks/hardlinks) to all of your example good email into a subdirectory named good.dir in the local directory. 3) IF you want to train from scratch (not necessarily good or bad, but your option... choose well): rm -rf spam.css rm -rf nonspam.css 4) Invoke the mailtrainer make cssfiles to build your new spam.css and nonspam.css files. That's all. It'll take a few minutes to run but mailtrainer will give you running status so it's not like things have hung. Again, let me emphasize that doing this is ONLY recommended on full installs post 20060606 . Versions prior to that will hose you if you do this. -------- Step 4 Part 3 - Checking your installation Once you have set up mailfilter.cf, rewrites.mfp, the *list.mfp files, and the .css files, you can test your configuration by typing the following (The '^D' at the end is a control-D, which is an END-OF-FILE on Linux. Other systems may use a different END-OF-FILE character): ./mailfilter.crm This is a test. Just type a few lines of text that you might ordinarily get, like a short rant on why Perl is useless for big projects, or why Linux is superior or inferior to NetBSD. ^D or (to use mailreaver instead) ./mailreaver.crm This is a test. Just type a few lines of text that you might ordinarily get, like a short rant on why Perl is useless for big projects, or why Linux is superior or inferior to NetBSD. ^D If you have set up Mailfilter for Procmail-style filtering you will always get a small report back saying something like either of these (the actual numbers and some minor text strings will change, but you should have something that _vaguely_ looks like the following): From foo@bar Thu Sep 18 19:20:35 2003 X-CRM114-Status: Good ( pR: 12.630237 ) ** ACCEPT: CRM114 PASS SBPH/BCR TEST** Probabilistic match quality: 1.000000, pR: 12.630237 P(succ): 1.000000e-00, P(fail): 2.342950e-13 Features: 336, S hits : 4313, F hits : 5901 or: From foo@bar Thu Sep 18 19:19:39 2003 X-CRM114-Status: SPAM ( pR: -2.866484 ) ** REJECT: CRM114 FAIL SBPH/BCR TEST** Probabilistic match quality: 0.001358, pR: -2.866484 P(succ): 1.358082e-03, P(fail): 9.986419e-01 Features: 144, S hits : 2337, F hits : 3313 If you are using "mail to spamtrap account" filtering, then you will either get an "accept" report back (the first report above is an "accept") or the text you typed in will be mailed to your spamtrap address. If you don't get a report back, check the spamtrap address and see if your test text ended up there. If all the numbers are zero, or the result is "UNSURE", that's OK, it just means there isn't enough statistical information in the .css files yet to actually decide if it's spam or not. This is a good situation. If you don't get _either_ of the above, something is broken, either in your installation of CRM114 or in your configuration file. You need to fix the problem before you engage Mailfilter. If your installation and configuration passes the above test, congratulations! You have now configured mailfilter.crm . ----- Step 4 Part 4 - OPTIONAL - CHECKING YOUR .CSS FILES For all three (four?) methods of setting up your .css files, you can check that the .css files are reasonable. Use the "cssutil" utility. Note: this works fine for the default classifiers like Markov, OSB, and OSB Unique, but _not_ for Winnow, Hyperspace, or Corellative classifiers; for OSBF classifiers use osbf-util instead of cssutil. Type in: cssutil -b -r spam.css cssutil -b -r nonspam.css You should get back a report something like this: Sparse spectra file spam.css statistics: Total available buckets : 1048576 Total buckets in use : 506987 Total hashed datums in file : 1605968 Average datums per bucket : 3.17 Maximum length of overflow chain : 39 Average length of overflow chain : 1.84 Average packing density : 0.48 Note that the packing density is 0.48; this means that this .css file is about half full of features. Once the packing density gets above about 0.9, you will notice that CRM114 will take longer to process text. The penalty is small below packing densities below about 0.95 and only about a factor of 2 at 0.97 . Note - do NOT believe "ls -la" with respect to .css files! Because CRM114 uses memory mapping instead of file I/O (because it's much faster to go through the page-fault tables than through the file I/O system), the m-time (time last modified) and c-time (time created) never change, only the a-time (time last accessed), and that even the a-time only changes if your file system had the proper compile-time options to keep track of the a-time, and that defaults to "not keep track". Believe in what cssutil tells you- if new features show up after learning (because the bucket counts change), you _are_ learning and "ls -la" is lying to you! Conversely, if the bucket counts do NOT change, you have a file redirection or file protection problem and your system is NOT learning. That's bad and you need to figure out the problem and fix it. You can also see how easy it will be for CRM114 to differentiate spam from nonspam with your .css files. The utility "cssdiff" will compare the statistical features of two .css files. (again, only for Markov, OSB, and OSB Unique classifiers) Try it: cssdiff spam.css nonspam.css and you'll get back a report like: Sparse spectra file spam.css has 1048577 bins total Sparse spectra file nonspam.css has 1048577 bins total File 1 total features : 1605968 File 2 total features : 1045152 Similarities between files : 142039 Differences between files : 1279964 File 1 dominates file 2 : 1463929 File 2 dominates file 1 : 903113 Note that there's a big difference between the two files; in this case there are about 10 times as many differences between the two files as there are similarities. That's pretty much typical- and it's a good sign that your filtering should be quite accurate. Now, move on to "Step 4: Configuring Mailfilter". ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- Step 5: Engaging Mailfilter There are two common ways to engage Mailfilter.crm on your incoming mail stream: you can use Procmail recipes and have Mailfilter run as a procmail subprocess, or you can use the .forward hook of Sendmail (and Sendmail clones which also support .forward) In the first method (recommended), you use Procmail's ability to execute a program as part of a Procmail recipe to run CRM114, which adds headers as needed to let Procmail or your mail-reading program do the sorting. In the .forward method, you (or your system manager) must add a link from an execution command of crm114 to the directory /etc/smrsh. This is because sendmail will NOT run any program that isn't "approved" by the system manager (by linking it into /etc/smrsh/whatever). The output of mailfilter is then directly appended to your /var/spool/mail file (or possibly forwarded to your spam-bucket account). ----- Step 5 Method A: For Procmail and Maildrop Users For Procmail users just add a procmail recipe to .procmailrc to run CRM114 and mailfilter whenever your other procmail rules fail to decide what to do. Here's a sample Procmail recipe set. Notice that we actually have TWO recipes - one to actually run crm114 and mailfilter, the other to then sort the mail based on the result. # # :0fw: .msgid.lock | /usr/bin/crm -u /home/my_user_directory mailfilter.crm :0: * ^X-CRM114-Status: SPAM.* mail/crm-spam That's all that Procmail users should need. Mailfilter should now be active - send yourself a test message and see where it ends up. To use mailreaver instead of mailfilter, just put "mailreaver.crm" in instead of "mailfilter.crm" . If you get the test message, proceed to "Step 6: Training CRM114". ----- ( note: Sub-Method A-one) If you use an MUA that can highlight on headers, you can use something like this in your procmail (from Philipp Weiss): in .procmailrc CRMSCORE=`$HOME/bin/crmstats.sh` :0fw: .formail.crm114.lock | formail -I "X-CRM114-Score: $CRMSCORE" where ~/bin/crmstats.sh is a simple script: #!/bin/bash grep -a -v "^X-CRM114" | \ /usr/bin/crm -u $HOME/.crm114 mailfilter.crm --stats_only ------ (note: Sub-Method A-two) If you're using maildrop ( http://www.courier-mta.org/maildrop.html ), you can put this in your ~/.mailfilter (from Stefan Seyfried and Joost van Baal) CRMSCORE=`grep -a -v "^X-CRM114" | crm -u $HOME/.crm114/ /usr/share/crm114/ma\ilfilter.crm --stats_only` xfilter "formail -I \"X-CRM114-Score: $CRMSCORE\"" if ($CRMSCORE < -1) { xfilter "formail -I \"X-CRM114-Spam: yes\"" } log "Spam: $CRMSCORE" if (/^X-CRM114-Spam: yes/) { to Mail/spam/inbox } ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- Advanced Topic: Huge Emails and Denial Of Service Avoidance CRM114 has a number of built-in anti-Denial-of-Service (anti-DoS) features; one of them is that it will not grow buffers beyond a certain limit, No Matter What. This default maximum is altered with the -w parameter. However, you may find that you actually recieve emails bigger than this limit. In these cases, it is effective to simply filter on the first few tens of kilobytes of incoming text; that will speed things up a lot. [[ Obsolescence note: CRM114 builds prior to about 20050601 need the method described below. After that, mailfilter has the built-in option :decision_length: in mailfilter.cf which defaults to 16000 chars ]] This is easy to do with "head". head -c 10000 gives the first 10,000 characters of input, which is usually adequate for CRM114 to get a good decision on. This can be directly piped in right in the procmail command: :0fw: .msgid.lock | head -c 10000 | /usr/bin/crm -u /home/my_user_directory mailfilter.crm :0: * ^X-CRM114-Status: SPAM.* mail/crm-spam ----- Step 5 Method B: The .forward hook file For .forward hook users you should be aware that you should NOT put a direct link to crm in /etc/smrsh; since crm can do arbitrary things, (such as SYSCALL to invoke any command, it'd be like putting BASH there) you ought to attempt to control the damage as much as possible. 1) add a link from /etc/smrsh to crm114's executable binary in /usr/bin by becoming root and typing: cat > /etc/smrsh/crmfilter /usr/bin/crm mailfilter.crm >> /var/spool/mail/your_account_name_here ^D 2) add a .forward file to your account by typing: cat > .forward |/etc/smrsh/crmfilter ^D That's all. The mailfilter should now be active - send yourself a test message and see where it ends up. ---- Once you have engaged CRM114 mailfilter, you now get to train it to recognize spam and nonspam. Proceed to "Step 6: Training CRM114". Note: CRM114 contains a design decision that you may have to play with. Instead of doing memory management games, which both consume significant runtime CPU as well as present a major denial-of-service opportunity, CRM114 has an upper limit on the window size and it simply won't exceed that limit (it gives an error message if an incoming message tries to exceed the limit) You -can- change the maximum memory limit at runtime with the -w nnnnn flag; for example, if you want 100 megabytes of memory available, you can set that with ... -w 100000000 to set 100,000,000 bytes as the hard limit ceiling on per-buffer memory usage. Actual usage may be about five times that number, as CRM114 does a buffer-shuffling dance to minimize time spent reclaiming and compactifying memory. --------------------------------------------------------------------------- Step 6: Training CRM114 and Mailfilter One of the great strengths of CRM114 Mailfilter is that it has no preconcieved notions of "spam" and "nonspam". It _learns_ what you consider spam, and what you consider nonspam. For the first few days CRM114 will make a lot of mistakes sorting spam and nonspam. It is _very_ important that you train each mistake back into CRM114, otherwise it will never learn what you consider spam or nonspam. You should train in the mistake as quickly as possible. Start one morning and try to train every hour for the first few hours at least. Don't think you're training a computer- pretend you're housebreaking a new puppy. You train mistakes right from your mail reader. There are several ways to do this. Note that you can use mailfilter.crm _or_ mailreaver.crm interchangeably here; the instructions say "mailfilter.crm" but mailreaver.crm works exactly the same way from the user point of view. * Mail-to-Myself with In-Line Commands to retrain (Method A) * shell commands to retrain (Method B) * Mutt direct interface (Method C) * Some Other Interface (Method D) Whatever Way You Train : try to train _approximately_ equal amounts of spam and nonspam. If you are within 50% one way or the other, performance will be very good. If you are running mailfilter.crm: Train only errors! This is called TOE training. (TOE :== Train Only Errors) It's not necessary to train near-misses; experiments show that the performance increase on training near misses is miniscule at best, and may be negative at times. If you are running mailreaver.crm: Some messages may come through with a header that says "I am unsure about this message. Please train it either way." - so do exactly that. This is one reason mailreaver learns faster than mailtrainer, and why it's also more accurate. It's best for at least the first day or so that you check your mail at least every hour or so and send training information back to CRM114. This will help it rapidly converge on a good set of statistics for your particular mix of spam and nonspam. It will take several days worth of errors for CRM114's mailfilter to approach 95% accuracy, and around two weeks to a month to reach 99+ per cent accuracy. I usually exceed 99.9% accuracy (less than one error per thousand). Step 6 Method A: Mail-to-Myself The first way is to use the in-line command feature. Just forward the mistake back to yourself, with full headers (except edit out any CRM114-added headers or text). Just before the first line of the text to be "learned" as spam or nonspam insert a COMMAND line. Everything from the command line to the end of the message will be learned (so edit the text to remove things you _don't_ want considered indicative of spam/nonspam nature). The command line looks like this: command spam or command nonspam (for mailfilter.crm) command good (for mailreaver.crm) The "c" in "command" must be in column 1, and you must put your secret password into the command line. Don't use the <> brackets, use JUST your secret password. Examples: If your secret password was "Ihatespam", then the command line to learn something as spam would be: command Ihatespam spam and the command to learn something as nonspam would be: command Ihatespam nonspam (for mailfilter.crm users) or command Ihatespam good (for mailreaver.crm users) [[ Mailreaver users: if you have the cache enabled (which is the default) and the message you mail to yourself contains an intact SFID (Spam Filter ID), either in the Message-Id: field or in the X-CRM114-CacheID: field, then you don't need to worry about editing the text so that extra headers, footers, etc. are removed. The cached version of the message is saved during the first time the message was seen by mailreaver, and so headers, footers, etc. that are added by your MDA or MUA or other stuff will NOT affect accuracy. ]] If you are a mailreaver user, you also have a priority system you can access, either by editing your priolist.mfp file directly or by sending youself email in the following forms (where mypwd is the command passworda_regex_pattern is what will be used for priority matching. Priority matches can occur in both the headers and body of the text.) command mypwd maxprio +a_regex_pattern - sets a maximum priority GOOD command mypwd maxprio -a_regex_pattern - sets a maximum priority SPAM command mypwd minprio +a_regex_pattern - sets a maximum priority GOOD command mypwd minprio -a_regex_pattern - sets a maximum priority SPAM command mypwd delprio a_regex_pattern - deletes the first priority list entry that fully matches the regex pattern Step 6 Method B: Shell commands to retrain >> For mailfilter users (mailreaver is different - skip to below! << The second way to train in spam and nonspam is to use mailfilter.crm's shell command line options. When you find a spam that was mistakenly accepted as good mail, pipe it through mailfilter.crm with the "--learnspam" flag set, like this: bash> mailfilter.crm --learnspam < the_spam.txt Likewise, if you get an email that was falsely classified as a spam, pipe it through mailfilter with the "--learnnonspam" flag set, like this: bash> mailfilter.crm --learnnonspam < the_NON_spam.txt (yes, if you have a scriptable mail reader, you can put these functions right on the menu bars somewhere. Yes, that's a hint. :) ) [[ If you are using mailreaver.crm instead of mailfilter.crm, and cacheing is enabled, you don't even need to pipe in the full text in, all that's needed is either the intact X-CRM114-CacheID: line or the Message-ID line containing an intact sfid. That's another reason to switch to mailreaver! :) ]] >> For mailreaver.crm users << You're in luck, assuming you have taken the default and left cacheing turned on. All you need to pipe into mailreaver for training is any text or text fragment containing an intact X-CRM114-CacheID: line or the Message-ID line containing an intact sfid; mailreaver will go get the exact incoming text of the message and train it, so you don't need to worry about munged headers. The command looks like this: crm mailreaver.crm [options] < some_text.txt The command options you have available in mailreaver command line are: --spam - train the incoming text as SPAM (if there's a recognizable cacheid, use the cached msg). --good - train the incoming text as GOOD (if there's a recognizable cacheid, use the cached msg). --cache - default is to train using the text stored in the reavercache. Use --cache=NO to *not* use the cached version, if for some reason you don't want to. --dontstore - default is that every incoming message that isn't a training message (that is, --spam or --good) is put into the cache. Use --dontstore to not put into the cache (for example, "seekrit" users who aren't allowed to train or who might get msgs that you don't want archived). --stats_only - Don't do a full report or forwarding, just report the pR value on stdout. This is a value between (roughly) -1000 and +1000 where negative values indicate spammyness and positive values indicate goodness. For a simple test, just look at the first nonblank character. If it's a "-" sign, the input was spam. Because there's no other output, --stats_only forces --dontstore. --outbound This message is "outbound" - that is, known to be good. If it would classify as spam, train and cache it. Otherwise, no action. --undo To the extent possible, undo a training with this text (cached will be used if possible). --undo requires either --spam or --good as well. --fileprefix=dir Assume that the config file "mailfilter.cf" and the .css files are in directory "dir". Remember to use a final closing slash on the directory name, e.g. /my/home/dir1/ instead of /my/home/dir1. Otherwise, the filename will be spliced together from the last component of your --fileprefix and the nominal names, and you almost certainly don't want that. --config=file Don't use mailfilter.cf as the configuration file; instead use the file so noted. Part 6 Method C: For Mutt Users (Contributed by Mathieu Doidy and Joost van Baal:) In your ~/.muttrc, put: macro index \es "crmlearn --learnspam\n=spam/done\n" \"crm114 learn as spam, save in spam/done" macro index \eh "crmlearn --learnnonspam\n" "crm114 learn as ham" where crmlearn is this script grep -a -v "^X-CRM114" | \ /usr/share/crm114/mailfilter.crm -u $HOME/.crm114/ $1 | \ grep -a "^X-CRM114" Now you have two new macros in the Mutt index menu: * esc-s will tag a message, falsely classified as ham, as spam, * esc-h will tag a message, falsely classified as spam, as ham. Part 6 Method D: Some Other Method There are at least five other ways to retrain CRM114. Some interface with common mail readers, some are command line tricks. Rather than catalog them here (which would quickly go out of date) you should go to the CRM114 web page (crm114.sourceforge.net) and browse the list of applications under "Cool Stuff". Some of these are plugins, some are web-based MUAs, and some are entirely new mail filters. What To Do if CRM114 says "LEARNING UNNECESSARY..." --------------------------------------------------- Occasionally, some CRM114 configurations may refuse to learn an errror, claiming that it "got it right the first time" (yes, this is a subtle bug that is not allowing itself to be found, but there is reason to believe it has to do with the interaction of mail clients and headers and that some mail readers are lying to the user when they claim they are forwarding with full headers). While we applaud this self confidence, the error is still there, so you need to "force" the learning. You can do this either from BASH or from the mail-to-yourself command line. For BASH, add "--force" to the command line; for mail-to-yourself commands, just add "force" From BASH, add --force to the command line: # mailfilter.crm < the_error_text --learnspam --force for mail-to-yourself, add "force" to the command line: command mysecretpassword spam force (and similarly for nonspam). The training files "spamtext.txt" and "nonspamtext.txt" ------------------------------------------------------ [[ Note: this section is becoming obsoleted by the reavercache, which does more, better, and easier. ]] Whenever CRM114 learns a new spam or nonspam, it not only modifies the .css files, but it also keeps the source text of that learning in the files "spamtext.txt" or "nonspamtext.txt". These two files can be considered the "source code" of your .css files; they're all you really need to rebuild your .css files if/when you upgrade CRM114 and the .css file is changed but the algorithm is similar. For example, upgrading from Markovian filtering (the default) to Winnow or OSBF is "incompatible", and you might want to start with these files as a kickstart. ... but not necessarily; some filtering is radically different than Markovian; as we add new filters as technology moves forward, sometimes we will be able to kickstart, and sometimes we can't. - for upgrades that can use the current .css files, we will say so; - for upgrades that cannot use the current .css files, but *can* get kickstarted from spamtext.txt and nonspamtext.txt, we will say so; - for upgrades that are radically different enough that you must relearn from scratch, we will say so (and have you rename your old spamtext and nonspamtext files so that they will not be accidentally reused. If your mail system is so short of disk that you cannot afford to keep these (relatively) small files, then you may either delete them or symlink these files to /dev/null; you don't absolutely *need* them. These files are quite small though- I have been running CRM114 for nearly five years now and my *total* example text sizes are 678 Kbytes for nonspam and 893 Kbytes for spam (after something like five years of daily use and about a gigabyte of email). ----------------------------------------------------------------------- Step 7: Adding Priority Lists, Whitelists, and Blacklists If you really want, you can add white, black, and priority lists to CRM114. Most people don't need them, but there are always exceptions. [[ Note to mailreaver.crm users - mailreaver.crm uses ONLY the priolist.mfp, and does NOT support whitelist.mfp or blacklist.mfp. This really is no loss of functionality, because anything you can do with a whitelist or blacklist, you can also do with a priolist, and more besides. ]] For example, your lawyer, your boss, and your paramour all probably rate being on your "whitelist", so whatever they send to you is always marked "nonspam". Likewise, your ex-girlfriend/boyfriend, your nagging acquaintance, and the stalker from the library should all get blacklisted. Whitelisting, blacklisting, and prio-listing are all based on regex matching. If the regex you put in the file "whitelist.mfp" matches the incoming mail _anywhere_, the mail will be marked "good" no matter how it scores statistically. Similarly, if the mail matches any regex in "blacklist.mfp", the mail will be marked as "spam", no matter how it compares statistically. Note that sometimes this can cause considerable confusion, for example "ac.com" in a whitelist will not just match "billing.ac.com", but also "drac.complete.viagra.sales.com" (the match being the 'ac.com' in "drac.complete"). To prevent this, use ^ and $ to "anchor" the start and end of the regex, if possible. Lastly (well, actually firstly, because prio-listing happens before whitelisting or blacklisting) any mail that matches any regex in priolist.mfp . The format of priolist.mfp is that the first character on the line is a + or a -, which indicates "whitelist" or "blacklist", and the rest of the line is a regex. These regexes are tested in the order given in the file. An empty file is perfectly acceptable. For examples of how to set up the whitelist, blacklist, and priolist files, see the included "whitelist.mfp.example", "blacklist.mfp.example", and "priolist.mfp.example". Note: for my accuracy tests, I *turn off* whitelists, blacklists, and prio-lists. Be sure to test any whitelist, blacklist, or other list that you add, otherwise you may get a rude surprise some day. ---------------------------------------------------------------- Step 8: Useful Utilities You don't _need_ to know the stuff in this section to set up and use CRM114 and mailfilter or mailreaver, but it might be useful to you- or at least satisfy some of your curiosity. There are three utilities for dealing with the .css files (these are the files that contain the "learned information"). The utilities are: cssutil - gives you a readout of the characteristics of the information in a .css file cssdiff - gives you a summary of the differences between two .css files (handy for seeing learning!) cssmerge - merges two .css files into one; handy for importing new data into a .css file. Note that this is a destructive operation on the first .css file named! The cssutil utility: Usage is cssutil somefile.css which will give you statistics on the file somefile.css. You can then rescale, clip, and otherwise manage your .css files. It is especially useful to check the "Average Packing Density" of the .css files you use; when it approaches .7 to .8, you may want to consider enlarging your .css file. To do that, see below on "Enlarging a .css file" Here's the -h help: Usage: cssutil [-b -r] [-s css-size] cssfile -h - print this help -b - brief; print only summary -r - report then exit (no menu) -s css-size - if no cssfile found, create new cssfile with this many buckets. -S css-size - same as -s, but round up to next 2^n + 1 boundary. The cssdiff utility ------------------- To get the difference between two .css files, use ./cssdiff somefile.css anotherfile.css which writes out a summary of how two different .css files are. The cssmerge utility -------------------- To merge two .css files, use cssmerge . ./cssmerge outfile.css infile.css Note that this is _destructive_ to outfile.css, so make a copy somewhere else first. You _CAN_ merge two .css files of different length. You can also expand (or contract) a .css file this way: rename the old file, and allow a new one to be created with learnspam or learnnonspam while using the '-s nnnnnnnnn' s(lots) flag to set the number of feature slots desired in the new file. Then cssmerge your old file into the fresh new file, and all is well. Here's the cssmerge help: Usage: cssmerge [-v] [-s] will be created if it doesn't exist. must already exist. -v -verbose reporting -s NNNN -new file length, if needed Enlarging a .css file --------------------- One of the advantages of CRM114 is that the .css files are relatively small and of fixed size; they don't grow out of control and never need trimming if you use , which is the default. The disadvantage of this is that if your spam/nonspam discrimination is too convoluted, it won't be able to sort them out ( in trek-speak this is a high-order nonlinearity in the discrimination function ). The fix in this situation is to increase the dimensionality of the feature space. The number of dimensions is about 1/12 the number of bytes in the .css files; this works well at about a million dimensions (12 megabytes) for most people. But if you're not most people, you may need to (eventually) increase it. You can tell when this is necessary- running cssutil will give you a utilization and percentage of slots full; when that gets up near 95 percent, you may be running low on space and old features will be erased to make room for new features (that is, your feature set will dynamically evolve in real time to find what works.) However, that's slow and may cause a slight loss of accuracy. One way to fix this is to "increase the dimensionality of the discrimination hyperspace" (no, I am not making that phrase up). It means to add new slots to the .css files. The easiest way to do this is to 1) use cssutil to create a temporary, empty, larger .css file 2) merge the data from the old, small .css file onto the new big file. 3) copy the new big file over the old, small file. You can even combine steps 1 and 2, because newer versions of cssmerge will create a new file if needed (the -s N flag sets the number of slots in the new file; -S N does the same thing but rounds up to a 2^N+1 boundary, which is recommended ). For example, here's how to increase the size of the spam.css file from 1,000,001 slots (the default) to 2,000,001 slots. Just type: cssmerge temporary.css spam.css -s 2000001 mv temporary.css spam.css The newly replaced spam.css will have all of the features of the old spam.css file, but will be 2000001 slots long instead of the default 1000001 slots. -------------------------------------------------------------------- -------------------------------------------------------------------- APPENDIX 1 Using mailtrainer.crm New (as of 20060117) is the training program mailtrainer.crm . This program will take directories of spam and nonspam files, and iterate over them to build (or improve) a set of .css files for you. ***** WARNING WARNING WARNING ***** Mailtrainer.crm (and the documentation for it) is BETA QUALITY. There are very likely some very amusing bugs. Be warned !!! Archive your data and your .css files before using mailtrainer.crm. Really! ***** WARNING WARNING WARNING ***** Mailtrainer by default uses whatever settings are in your current mailfilter.cf file, so you'll get .css files that are optimized for your standard setup including mime decoding, normalization, classifier flags, etc. However, this means you *must* set up your mailfilter.cf file FIRST, before you run mailtrainer. Mailtrainer.crm uses DSTTTTR (Double Sided Thick Threshold Training with Testing Refutation) which is something I didn't come up with (Fidelis is on my list of suspects for this). The good news is that this can more than double accuracy of OSB and similar classifiers. It is safe to run mailtrainer.crm repeatedly on a .css fileset and training data; if the data doesn't need to be trained in, it won't be (unlike the old "make cssfiles" command, which forces everything in whether it is useful or not). This is a big improvement and minimizes .css file bloating. "make cssfiles" has now been fixed to use mailtrainer.crm. The example files in each of the spam and good directories need to be one example per file. The closer these files are to what mailfilter.crm wil see in real life the better your training will be. Preferably the headers and text will be complete, intact, and unmutilated. The closer these examples are to what SMTP will show "on the wire" the better. If you use a mail reader that puts your "good" and "spam" emails as separate files in two different directories (or can hack up a script to do that) then you could even run mailtrainer.crm automatically every night to optimize the .css files to your current profile. If you do this, your script needs to gaurd against situations where you haven't checked your mail in a few days and errors crept in; for safety your script should only add the files to the training directories until you have hand-checked them (or at least tacitly agreed). If you find you've made a mistake, don't worry. It's recoverable. Just put the misplaced files into the correct directory and rerun mailtrainer.crm . That will re-optimize the .css files (though some low-value features may be swept away). Alternatively, if you start out keeping each and every file that you've trained, you can just delete the erroneous spam.css and nonspam.css files and re-run mailtrainer.crm to get correct .css files. It's OK to have the spam and good directories just be full of links (either symlinks or hardlinks) to the actual spam and good mail files (that's what I do). NOTE: mailfilter.crm doesn't (yet) understand how to build and maintain the spam and good email directories. NOTE 2: It is at this point unknown whether it's a good idea or a bad idea to run mailtrainer on the probably good and probably bad emails (which end up in the reaver cache as .../prob_good/whatever and .../prob_spam/whatever, or just on those that are in the thick threshold zone. If anyone gets good data on this, let me know please. ----- Mailtrainer Options --- The mailtrainer.crm options are as follows. You *must* provide --spam and --good; the other flags are optional. Required: --spam=/directory/full/of/spam/files/one/per/file --good=/directory/full/of/good/files/one/per/file These define the directories or files to be learned. If these end with a slash, it means a directory and all of the files within are used, otherwise, it's taken as a file. If the filename contains a wildcard, be sure to enclose it in singlequotes 'like.this' or else BASH will do bad things to it. Note that this is (currrently) incompatible with the --random shuffling of training order. Optional: --help - quick synopsys of mailtrainer options. --thick=N - thickness for thick-threshold training- this overrides the thickness in your mailfilter.cf file. Omit it if you want to use the in-file value. (10 works well for most classifiers; use 0.1 or less for Hyperspace) --streak=N - how many successive correct classifications before we conclude we're done. Default is 10000. This number should be larger than the total number of sample emails. --repeat=N - how many passes should we go through this corpus before we conclude we're done. Default is 1 --worst=N - run the entire training set, then train only the N worst offenders, and repeat. This is excruciatingly slow but produces very compact .css files. Use this only if your host machine is dreadfully short of memory. Default is NOT to use worst-offender training. N=5% of your total corpus works pretty well, but N=1 will produce the most compact .css files. --random - randomize the training set rather than taking the files in sequential alternating order (one from good, then one from spam). Note that this is (currently) incompatible with a wildcard for selection of good versus spam files. --reload - if we run out of one kind of file (good or spam) before the other, "reload" (start from the first file again) in that category. Default is to simply use only the remaining category for the remainder of the training pass. --verbose - Verbose. Print out more stuff. --fileprefix=directory - use the mailfilter.cf, rewrites.mfp, and .css files in 'directory', rather than in the current directory. --goodcss=somecssfile.css - use this 'good' cssfile instead of the default "nonspam.css" --spamcss=somecssfile.css - use this 'spam' cssfile instead of the default "spam.css" --collapse - collapse the flying output down to scroll less on a TTY. --report_header="some text" - put this at the head of the report --rfile="somefilename.txt" - append (not overwrite!) log to this file. --validate=regex_no_slashes - Any file with a name that matches the regex supplied is not used for training; instead it's held back and used for validation of the new .css files. The result will give you an idea of how well your .css files will work. Do NOT put slashes around the regex! Example 1: - We want to create new .css files for our mail filter - We already have presorted directories of good and spam email - We have already set up mailfilter.cf and rewrites.mfp to define our preferred configuration, Then we can use the following incantation to build some nice .css files (not perfect, but not bad). This incantation can all be on one line (remove the '\' backslash characters if you put it on one line), and don't forget the trailing slash for directory names; otherwise mailtrainer will try to train the directory listing itself (and fail, because a directory can't be read like a normal file). Note that you *must* set up your mailfilter.cf and rewrites.mfp files first, before doing this, otherwise you'll generate bad .css files, or possibly get an error! crm mailtrainer.crm \ --good=/your/good/files_dir/ \ --spam=/your/spam/files_dir/ \ --repeat=5 \ --random Example 2: - We want to run mailtrainer.crm against a bunch of examples in the directory ../SA2/spam/ and ../SA2/good/. (This happens to be where the TREC test set is on my computer- your location will be different) - We want to quit when we get 4000 tests in a row correct, or if we go through the entire corpus 5 times. - We want to use DSTTTR, with a training thickness of 5 pR units. - We want to "validate" our training - that is, to hold back some fraction of the training set as test cases. In our case here, we decide we want to use any file name that contains a "*3.*" . These files will be saved up and used as a test corpus instead of for training. Here's the command (this can all be on one line as well; if so, remove the backslashes): crm mailtrainer.crm \ --spam=../SA2/spam/ \ --good=../SA2/good/ \ --repeat=5 \ --streak=4000 \ --validate=[3][.] \ --thick=5.0 This will take about eight minutes to run on the TREC 2005 SA corpus of about 6000 messages; 1000 messages a minute is a good estimate for 5 passes of DSTTTTR training. Notes: * If the .css statistics files don't exist, they will be created for you, in the format set up by the mailfilter.cf file. So- be SURE to set up mailfilter.cf first! * If the first test file comes back with a pR of 0.0000 exactly, it is assumed that these are empty .css statistics files, and that ONE file will be trained in to each .css file that returns a 0.0000, simply to get the system "off center" enough that normal training can occur. If there is anything already in the files, this won't happen. * When running N-fold validation, if the filenames are named as in the SA2 corpus in a form of 00123.456789 , there's an easy trick to partition the data into 10 roughly equal sets. Just use a validation regex like [0][.] for the first run, [1][.] for the second run, [2][.] for the third, and so on. Notice that this a CRM114-style regex, and _not_ a BASH-style file globbing as "*3.*" would be. If you use a globbing regex like "*3.*" , then BASH will suck it in and expand it in-line to all of the individual filenames and that won't work. A regex like [chars] is invisible to BASH and so will pass unscathed. * If you want to run N-fold validation, you must remember to delete and rebuild a fresh set of .css files after each run, otherwise you will not get valid results. * N-fold validation does NOT run training at all on the validation set, so if you decide you like the results, you can do still better by running mailtrainer.crm once again, but DO NOT specify --validate. That will train in the validation test set as well, and hopefully improve your accuracy still more. --------------------------------------------------------------------- That's all! If you have errors or updates (or find bugs!) please let me know; the best way is to join the CRM114-general mailing list; it's on the webpage: http://crm114.sourceforge.net and ask there. The reason for using the mailing list rather than personal email is that personal email isn't archived, but the mailing list _is_ both archived and read widely, so we not only create a background archive of solutions but you will get a better answer back faster than if you sent the email to me alone. Enjoy, and good luck. -Bill Yerazunis