Last updated: 08 July 2005

OVERVIEW:

This software implements the fast, stable logistic regression
algorithms of Paul Komarek and Andrew Moore, using truncated
iteratively re-weighted least squares for parameter estimation.  You
can read about the algorithms at http://komarix.org/ac/papers.

This software and all documentation is licensed under the GNU General
Public License, version 2.  See the LICENSE file for more details.
The authors retain full copyright privileges.


CURRENT STATE:

The software still contains a lot of crud from the original research
software.  We are now below 17 kloc, having started at 170 kloc.
There is still a lot of room to reduce the size and complexity of this
formerly-research software.  For instance, we do not need the dynamic
capabilities of our vector types.  We endeavor to make this code ANSI
C99 compliant.  In fact, most of our code follows the ANSI C89
standard, and use of C99 will probably be restricted to the improved
and safer string library functions.


BUILDING THE SOFTWARE:

Our software may require GNU make in order to build properly.  If you
have strange problems with our makefiles, please be sure to try
building with GNU make.  Edit Makefile.conf if you feel the need, or
if the software doesn't compile for you.  You can enable zlib support
from Makefile.conf, which allows loading of compressed dense numerical
data files and saving to compressed files.  See the documentation in
doc/ for more information.

There are two executables to build, train and predict.  Simply run
"make" in this (top-most) directory to build them.  This will build a
gently optimized (-O2) version of the software.  We plan to add a
k-fold cross-validation executable as soon as we finish the clean-up
work.  Of course, you can use the train and predict programs with a
wrapper-script to achieve the same thing, but this isn't very
convenient.

Run "make test" to run a small test of the software, and look for
"Success: all" near the bottom of the output.

Run "make doc" to build the documentation.  This will require latex
(version 2e) at the very least, to create the dvi output. dvips is
used to create the PostScript output, and ps2pdf handles conversion to
pdf format.  latex2html allows creation of html documentation, and
html2text is used to create plaintext documentation.  Hopefully I have
remembered to prebuild the documentation for you.

You can edit Makefile.conf if you want to change the compile flags,
link flags.  To build a debuggable version of the software, use the
"t=debug" flag.  For example, "make t=debug".  To build for profiling,
use "t=profile". "t" stands for "type".  Note that you will have to
remove old object (.o) files if you want the entire application
to use a new compile type.  See "make cleanall" below.

You can cd to most subdirectories and run "make" to build the library
or executable for that directory.  For instance, "cd train && make".
If you run "make clean", some localized cleaning of files will be
done.  If you run "make cleanall" in this (top-level) directory, then
besides localized cleaning you will also clean all subdirectories.  If
you run "make cleanall" in the a program or library subdirectory, you
will clean that directory and the directories for all dependencies.


USAGE:

Short usage instructions for the train and predict programs will be
printed if they are run with no arguments.

Full instructions for use can be found in the doc directory.  Note
that you may have to run "make doc" (or just "make" in the doc
directory) to generate the .ps, .pdf, .txt, and .html documentation.

You will find two simple datasets in this (top-level) directory.  They
contain the same data, but in two different formats.  a-or-d.csv is
in a comma-separated-value format suitable for dense non-binary data.
a-or-d.txt is in a sparse format suitable for sparse binary data.  In
both cases, the outputs must be binary.  For csv files, outputs are
the last column.  For spardats, outputs are the first column.  For
more information about data formats and naming conventions, see
the full documentation.  If anything about our data conventions seems
strange, it is probably for historical reasons.


SMALL EXAMPLE:

make
cd train
./train in ../a-or-d.csv save params.txt
cd ..
cd predict
./predict in ../a-or-d.csv load ../train/params.txt pout predictions

Note: We realize that you generally do not want to predict values on
exactly the same data from which you estimated the model parameters.
This example above only describes how to run the programs, not how to
mine data correctly.

-Paul Komarek
