Jul 22 2009

MLSharp Available

Category: MachineLearning | mlsharpMatt @ 00:46

Last night, I finished moving ML# (mlsharp or ML-Sharp) to Google Code.  You can check it out here.  This project was originally just a library I used to run experiments for a grant I was working on.  Since then, it has matured a bit and is now (I feel) on the verge of becoming a useful tool for machine learning.  It is currently limited to only a few classifiers, and it depends on Weka for some of its core functionality, but I plan to address these deficiencies as time allows. 

In future posts, I’ll be describing some of the features of ML# and how it all works.  For now, ML# is out there for the world to throw stones at.  At this stage, I would love to have some feedback on things like high-level organization, design, coding style, etc. so that I can get those things locked down before moving forward.  Feel free to check out the code.  If you are interested in contributing to the project in a larger way, send me an E-mail. 

My work week has gone to crap, so no liteGrid update today.

Tags:

Apr 10 2009

A (much) more informed view of Infer.NET

Category: MachineLearningMatt @ 08:44

In response to my naive impression of Infer.NET, John Winn has supplied a wealth of insight into its design and implementation:

Many thanks for taking the time to look at Infer.NET and post your comments. As one of the researchers behind Infer.NET, I would like to explain a bit more about the project which may make some of the design choices more understandable. Software packages like Weka are great for loading, visualizing and applying existing machine learning algorithms to data. In these packages, the machine learning algorithms are 'black boxes' - you supply them with data in the form they expect and they perform clustering/regression/classification or whatever. In contrast, Infer.NET is a tool for making new algorithms, customized to solve particular problems that don't fit neatly into one of the existing forms. For example, a recent problem we looked at involves ranking a number of papers submitted to a conference where each paper has a certain number of reviewers and reviewers have biases - so you need to learn both the paper's quality and the reviewer's bias simultaneously. This problem does not fit into one of the above categories, but can be represented as a Bayesian model and solved in Infer.NET. You can also solve standard clustering, regression and classification problems by constructing the appropriate model - for example, clustering is achieved by creating a mixture using a Switch expression, such as in this example. But because Infer.NET opens up the black box you can then take your model and tailor it to your problem (e.g. these two classification results should be similar, some of my data is labeled and I'd like the clustering to reflect that, I know that this output must lie between these other two outputs). This ability to create an algorithm exactly matched to the problem you are trying to solve turns out to be very valuable in many application domains, since you can include rich domain knowledge in the model. We also have many examples of problems which simply cannot be solved without this ability.

To handle a large range of models efficiently, the Infer.NET engine is structured as a compiler - it takes a model definition and compiles it into C# code for solving the model. This allows Infer.NET to be efficient enough to apply to huge datasets (e.g. billions of datapoints). Infer.NET can execute the generated code for you or you can take the code and embed it in a .NET application of your own. The magic string name you mention is the name of the variable in the generated code - it is entirely optional - if you don't supply it then the system will generate a name for you, but the code will be less readable. The Variable.If construct does require you to define your model in a single thread - however the generated code *is* fully thread safe and can be executed either multi-threaded or fully distributed on a cluster. One alternative to the admittedly ugly Variable.If construct would be to have a separate text file describing the model (like in BUGS) and you could load the model from the file. However, this would prevent dynamic construction of the model in code and would require users learning a new modeling language, whereas the slightly less elegant API version allows dynamic construction and can be called from any .NET language. Another alternative we considered is using LINQ expressions to specify the model, but these do not support statements (such as if,switch) at the moment. These are planned for future versions of LINQ, at which point we will look again at this option.

In the project so far, we have concentrated our effort on the core inference engine - making it so that it can be applied to a large range of models, making inference fast and validating the engine on a wide range of problems in different domains. We realize that, in its current form, the compiler is less approachable than GUI-based inference software, and this is something we are looking to address in future. However, if you have a problem which is not in a form that is solvable by a standard black-box machine learning algorithm then Infer.NET has a lot to offer.

Best wishes,

John Winn
Microsoft Research Cambridge

So, I think I missed the point of what Infer.NET provides: it’s not a framework of machine learning techniques (like Weka), it is a framework for building machine learning algorithms.  A lot of the syntax nastiness makes a lot more sense now, too.  It’s probably still beyond the reach of many developers, but those experienced in probabilities and machine learning should be able to build powerful models using Infer.NET.  Thanks for the response, John!

Tags:

Mar 27 2009

Naive impressions of Infer.NET

Category: MachineLearningMatt @ 08:51

I finally found time to look through Infer.NET.  For those not in the know and too lazy to click on links, Infer.NET is a new projecting coming out of Microsoft Research.  According to them, it’s a framework for machine learning.  They claim that provides a lot of state-of-the-art stuff, including “message passing.”  I’m not sure why “message passing” is an important feature in a package that claims to be for machine learning, but that’s what the website says, so…

Despite what the website says, I don’t really see this as being a framework for machine learning.  I see it as an API for probability, which is indeed a very important part of machine learning, but it certainly is not a framework for ML in the same as Weka is.  Don’t download this package expecting to find a lot of variety in what you can do.  All you really get here is a single type of classification technique: Bayesian models.  I did not see anything that resembled clustering, and none of their examples dealt with regression/prediction of real-valued functions, so I’m guessing that’s not supported either. 

Ignoring the fact that this is definitely not an ML framework, it is a neat idea.  Bayesian methods are very powerful, and this could be a good framework for using them to solve large problems in .NET.  That said, I see a lot of flaws with it.  It’s a beta release, so I’m going to be gentle, but I want to get these out in the open on the off chance that someone at Microsoft is listening.

First, the API is really not very clean.  Let’s look at an example:

   1: Variable<double> x = Variable.GaussianFromMeanAndVariance(0, 1).Named("x");
   2: Variable.ConstrainTrue(x > 0.5);

Why in the world should I have to assign a magic-string name to my variable?  I’m going to assume that this will be fixed before the next release, because that’s just ugly, and I can see no real reason for it.

Next, we have this beauty:

   1: Variable<double> probIfTreated, probIfControl;
   2: using (Variable.If(isEffective))
   3: {
   4: // Model if treatment is effective
   5: probIfControl = Variable.Beta(1, 1);
   6: controlGroup[i] = Variable.Bernoulli(probIfControl).ForEach(i);
   7: probIfTreated = Variable.Beta(1, 1);
   8: treatedGroup[j] = Variable.Bernoulli(probIfTreated).ForEach(j);
   9: }

Maybe I’m missing something, but that *looks* dangerously like there is something static that’s tracking that I have created an If-block, and the If-block is in effect until it is disposed.  Intuitive?  No.  Thread-safe?  Almost certainly not (not that you should be doing much threading with an API like this today, but in 5 years when we’re all running 72 core machines, maybe). 

Overall, the API just doesn’t feel very clean or C#-like to me.  I can’t help but think that the exact same logic could be expressed in a cleaner, simpler way.  Someone that is not very comfortable with probabilities and statistics is not going to be able to use this API in its current form.  If this is going to be portrayed as a toolkit for machine learning, I strongly recommend that the authors look at Weka and try to come up with a similarly intuitive and general API.

In its current state, this is not a toolkit for machine learning, but it is an interesting project.  I’m very glad to see that the trend of cool things coming out of Microsoft Research is continuing.  Just don’t download it looking for the C# version of Weka.

Tags:

Mar 19 2009

A summary of text classification/document categorization

Category: MachineLearningMatt @ 08:19

I’m currently working on a group project for school in the domain of document categorization, but a large portion of my group knows very little about the area.  So, I decided I would do a write-up that explains the domain, the basics of common approaches, and things of that nature.  What follows is meant to be a high-level crash-course in document categorization, but it is written by someone who is not an expert in the domain.  If you see anything wrong with anything in this post, please let me know.

Document Categorization

Document categorization deals with the assignment of documents to one or more categories.  For the purposes of this article, I’m going to deal only with document categorization using supervised machine learning techniques, but there are other approaches.  Unsupervised learning techniques, such as clustering, and semi-supervised learning techniques are also used, but they’re beyond the scope of this write-up.

Supervised machine learning involves learning a function from input data that is labeled with the correct output.  In the case of document categorization, the training data typically consists of documents that have been labeled with the correct category or categories.  A supervised machine learner will analyze the training documents and attempt to learn some function f(x) (called a classifier) that maps new documents to the correct categories.  If this sounds difficult, that’s because it is.  First, you need training data.  Next, you need to come up with some way to represent the documents to the machine learner. Finally, you have to choose a specific machine learner to use.

Training Data

How many documents do you need to train the machine learner?  The more, the better.  In pure supervised learning approaches, that typically means manually identifying training documents and painstakingly labeling each of them with the appropriate categories.  This can be a very expensive and difficult process.  There has been a lot of work, especially in the area of semi-supervised learning, on reducing the burden of acquiring training data.  Those techniques are beyond the scope of this article.

Representing Documents

Assuming you have training data, what next?  The documents need to be processed into some form that’s suitable for machine learning.  Unsurprisingly, you can’t just feed in a massive blob of text and expect magic; you have to do a little more work than that.  The most common representation is the document vector space model.  Each document is tokenized and broken into its constituent terms.  Often stop lists are employed that filter out common, useless terms such as ‘the’.  Other preprocessing steps, such as stemming and normalization, are also common.  The filtered and processed terms are then used to create a vector for the document.  There are a lot of variants, but most are derived from simple term frequency vectors.  In a term frequency vector, the dimensions of a vector correspond to terms, and the values of the dimensions correspond to the number of occurrences of the term in the document.  Take this is an example:

Document: The brown fox jumped over the brown cow.

brown fox jump over cow green red blue
2 1 1 1 1 0 0 0 0

In this example, ‘the’ was filtered due to being a stop word, ‘jumped’ was normalized via stemming, ‘brown’ has a score of 2 because it appeared twice in the document, and the terms that didn’t appear in the document (green, red, blue, etc) all have a score of 0. 

Vectors are N dimensional, where N is the number of distinct terms across all the training documents.  As you can imagine, N is a really big number in most domains.  Typically, vectors are represented using a sparse data structure since individual vectors will consist mostly of zeroes. 

In addition to filtering stop words, a lot of work has been done on feature selection, which deals with deciding which terms (and how many) are actually useful for a given categorization task.  Feature selection typically involves statistical analysis or domain knowledge to identify which terms are important.  Good results have been achieved in many domains by using only a very small subset of the available terms (think tens instead of hundreds).

Choosing a Machine Learner

At this point, the training data has been converted to some form that is useful to a machine learner. Now its time to choose which learner to use.  Classification is a huge area of machine learning, so you have a lot of options.  Popular choices in document categorization are artificial neural networks, naive bayes, and support vector machines.  Support vector machines have been heavily used in recent years due to their ability to work well with high dimensional, noisy data.  A popular open-source support vector machine package is libSVM.  Unlike artificial neural networks, which require extensive manual setup in the form of network topology, support vector machines have few parameters that require adjustment.  Often the default settings for libSVM will achieve good results.

Evaluating Performance

Evaluating the performance of a trained document classifier can be difficult. In document categorization, what is typically desired is a estimate of how well the classifier will perform on new documents that were not part of the training set.  Evaluating performance on the same documents that were used to train the classifier will lead to an overly optimistic estimate.  Instead, holdout, cross validation, and bootstrapping can be used.  10-fold cross validation has been shown to have low variance and bias, but it can be computationally expensive.  This is especially true in document categorization due to the high dimensionality and number of documents involved.

Common measures of performance in document categorization are:

  • Accuracy – The number of documents correctly classified divided by the total number of documents.
  • Precision – The number of documents correctly assigned to the category divided by the total number of  documents assigned to the category.
  • Recall – The number of documents that were correctly assigned to a category divided by the total number of documents that actual belong to the category.
  • F-Measure – A combination of precision and recall (typically it is the harmonic mean of the two).

Though accuracy is simple to calculate, it is not a very useful metric for document categorization. Assume a simple training set that consists of 100 documents, 10 of which belong to a category.  A naive classifier can achieve 90% accuracy by simply classifying each document as ‘false’ for membership in the category.  Precision and recall are better measures.  The two are often inversely related.  Improving the classifier’s precision will lower its recall.  When a single number measurement is desired, F-Measure can be used.

Summary

Text categorization is a very broad (and very active) area of research.  This summary should help point you in the right direction, but it should not be considered exhaustive.  If there is interest in future posts on the topic, let me know, and I’ll dive deeper into some of the issues in text categorization.

Tags:

Nov 3 2008

Bridging the Java-.NET Gap: foreach-ing an Enumeration

Category: .NET | MachineLearningMatt @ 10:08

My fun with IKVM.NET continues this week as I utilize Weka from .NET (just a fun note, but the .NET-compiled version is insanely faster than the Java version doing the exact same thing; suck on that, Java fans!).  For the most part, everything has been swell.  The hardest part is trying to decipher research methodologies to replicate the systems they describe.  Still, I've run into a few snags when working with the .NET versions of Weka, the first of which is that you can't foreach over a Java Enumeration.  For example, consider the following:

   1: //THE FOLLOWING DOESN'T COMPILE!
   2: //foreach (weka.core.Instance instance in instances.enumerateInstances())
   3: //{
   4: //    //TODO: Operate on instance here.
   5: //}
   6:  
   7: Enumeration enumerator = instances.enumerateInstances();
   8: while (enumerator.hasMoreElements())
   9: {
  10:     weka.core.Instance instance = (weka.core.Instance)enumerator.nextElement();
  11:  
  12:     //TODO: Operate on instance here
  13: }

While it isn't a huge difference, being able to use foreach would obviously be simpler than having to create an instance of Enumeration, then use it to step through the items.  Fortunately, it's quite easy to generically bridge the gap so that any Enumeration can be enumerated by simply calling an extension method, like so:

   1: foreach (weka.core.Instance instance in instances.enumerateInstances().ToEnumerable())
   2: {
   3:     //TODO: Operate on instance
   4: }

How does this work?  First, we need to apply the adapter pattern to convert a Java Enumeration into a .NET IEnumerator.  Here's our adapter:

   1: /// <summary>
   2: /// Provides an adapter that can convert a Java
   3: /// Enumeration class into something that implements
   4: /// IEnumerator.  
   5: /// </summary>
   6: public class EnumeratorEnumerationAdapter : IEnumerator
   7: {
   8:     #region Private Fields
   9:  
  10:     /// <summary>
  11:     /// The class being adapted.
  12:     /// </summary>
  13:     private Enumeration mEnumeration;
  14:  
  15:     /// <summary>
  16:     /// The current object.
  17:     /// </summary>
  18:     private object mCurrent;
  19:  
  20:     #endregion
  21:  
  22:     #region Implementation of IEnumerator
  23:  
  24:     /// <summary>
  25:     /// Advances the enumerator to the next element of the collection.
  26:     /// </summary>
  27:     /// <returns>
  28:     /// true if the enumerator was successfully advanced to the next element; false if the enumerator has passed the end of the collection.
  29:     /// </returns>
  30:     /// <exception cref="T:System.InvalidOperationException">The collection was modified after the enumerator was created. </exception><filterpriority>2</filterpriority>
  31:     public bool MoveNext()
  32:     {
  33:         if (!mEnumeration.hasMoreElements())
  34:         {
  35:             return false;
  36:         }
  37:  
  38:         mCurrent = mEnumeration.nextElement();
  39:         return true;
  40:     }
  41:  
  42:     /// <summary>
  43:     /// Sets the enumerator to its initial position, which is before the first element in the collection.
  44:     /// </summary>
  45:     /// <exception cref="T:System.InvalidOperationException">The collection was modified after the enumerator was created. </exception><filterpriority>2</filterpriority>
  46:     public void Reset()
  47:     {
  48:         throw new NotSupportedException();
  49:     }
  50:  
  51:     /// <summary>
  52:     /// Gets the current element in the collection.
  53:     /// </summary>
  54:     /// <returns>
  55:     /// The current element in the collection.
  56:     /// </returns>
  57:     /// <exception cref="T:System.InvalidOperationException">The enumerator is positioned before the first element of the collection or after the last element.-or- The collection was modified after the enumerator was created.</exception><filterpriority>2</filterpriority>
  58:     public object Current
  59:     {
  60:         get
  61:         {
  62:             return mCurrent;
  63:         }
  64:     }
  65:  
  66:     #endregion
  67:  
  68:     #region Public Constructors
  69:  
  70:     /// <summary>
  71:     /// Creates an adapter for the specified enumeration.
  72:     /// </summary>
  73:     /// <param name="enumeration"></param>
  74:     public EnumeratorEnumerationAdapter(Enumeration enumeration)
  75:     {
  76:         mEnumeration = enumeration;
  77:         mCurrent = null;
  78:     }
  79:  
  80:     #endregion
  81: }

Next, we just need to write the extension method that utilizes our adapter to create an IEnumerable:

   1: /// <summary>
   2: /// Contains extension methods to simplify working with 
   3: /// <see cref="Enumeration"/> objects.
   4: /// </summary>
   5: public static class EnumerationExtensions
   6: {
   7:     /// <summary>
   8:     /// Creates a <see cref="IEnumerable"/> wrapper
   9:     /// around a <see cref="Enumeration"/>.
  10:     /// </summary>
  11:     /// <param name="enumeration"></param>
  12:     /// <returns></returns>
  13:     public static IEnumerable ToEnumerable(this Enumeration enumeration)
  14:     {
  15:         EnumeratorEnumerationAdapter adapter = new EnumeratorEnumerationAdapter(enumeration);
  16:  
  17:         while (adapter.MoveNext())
  18:         {
  19:             yield return adapter.Current;
  20:         }
  21:     }
  22: }

And like magic, you can now foreach over any Java Enumeration just like it was a .NET IEnumerable implementor.  It'd be nice if this were baked in to IKVM.NET, but for now, this simple "hack" will do the trick.

Tags:

Sep 16 2008

Machine Learning: Why you should care

Category: MachineLearningMatt @ 10:11

In the last post, I introduced the topic of machine learning.  In this post, I'll describe an example problem, discuss how you might go about writing code to address the problem, then discuss how you can apply machine learning to the same problem and get the computer to do the heavy lifting for you.  In the end, hopefully you'll at least have a vague idea of how machine learning might be useful.

The problem

You work for a company that makes data entry software for hospital emergency rooms.  One day, your boss comes in and says that you're going to add a new feature to the software: based on patient vitals and stats, it will prioritize patients automatically into one of two categories: critical or not critical.  What do you do?  On his way out the door, your boss leaves you with a database containing about 100,000 records from the local ER.  Each record contains stats on some patient and whether or not the supervising doctor decided that the patient was critically injured.  That's all you have to work with.  Stop and think about how you would try to solve this problem. 

Approach 1: Write code

You decide that you'll write a series of if/else-if statements to solve this problem.  Surely that will work, right?  Well, it turns out that there are about 40 stats (we'll call these attributes) for each patient (and we'll call patients instances), so which attributes do you use in your if/else-if statements?  This is going to get pretty hairy in a hurry.  After checking a few instances, you end up with something that looks like:

   1: if (patient.BP < minBP && patient.Pulse < minPulse)
   2: {
   3:     return true;
   4: }
   5: if (patient.O2 < minO2 || patient.Pulse > maxPulse)
   6: {
   7:     return true;
   8: }
   9: if (patient.Pulse == 0)
  10: {
  11:     return true;
  12: }

So far so good!  Except now you look at the next instances, and suddenly you have a patient that meets the requirements for the first if statement, but they're labeled as not critically injured.  So, you carefully examine the instance, and try to tack another check on an attribute into the if statement, but that causes the statement not to match an instance that it should be matching.  So, now you have to add yet another if statement, and it has to go before the original if statement.

Now, multiple that scenario by about 100,000, and you're finished.  Wasn't that fun and easy?

Approach 2: Encode human knowledge

So approach 1 didn't work out so well.  Instead, you try asking the doctors what criteria they use to decide whether or not someone is critically injured.  You get a short explanation and encode it as a couple of rules, easy enough.  Except that when  you test it on your 100,000 records, it gets the vast majority of them wrong.  When you bring a few examples to the doctor, he says "Oh, yeah, the patient isn't critically injured *UNLESS* this attribute has this value, too, then they're critically injured."  You take the knowledge back, rework your rules to encode this new knowledge, and find that you're still not doing a very good job at re-classifying your 100,000 instances.  After multiple iterations with the doctor, you have a set of rules that seem to work.  You roll it out into production, and immediately the phone starts ringing off the hook.  "Your software says this guy is critically injured, but he's fine!" Even though your rules work very, very well on your 100,000 records, they seem to do very poorly in the real world.

Approach 3: The Machine Learning way

You throw away all your rules, and instead decide to try out these fancy machine learning tools.  You don't really know much about the tool, other than you give it data, tell it what attribute to predict, and it builds a model that you can apply to new instances.  For this problem, you feed it in your 100,000 records and tell it to build a model that can predict the critical/not critical status.  After a few short seconds, it spits out a model that works quite well on the 100,000 instances you trained it with (note that you actually don't want it to be 100% accurate most of the time, more on that in a future post).  You hook the model in to your code; all you have to do is pass it an instance, and it passes back a true/false, much cleaner than 100 if/else-if statements strewn about.  You then roll the code out to the world and wait... the phone rings, and people do complain that it isn't 100% accurate, but the calls are infrequent.  Most of the time when it is making mistakes, it is doing so on patients that are borderline anyway.  All in all, not bad considering you had to write almost no code.

So what happened?

The machine learning approach worked because there were patterns in the data that the computer was able to learn to recognize.  The patterns were too subtle for you to pick up on given the sheer size of the data set and the number of attributes on each case.  Our brains have a hard time working across many dimensions at once. The machine learner is really good at that sort of thing though.  It's able to consider all 100,000 records and all 40 attributes quite easily.  It was able to identify patterns in the training data that you fed it, and it generalized those patterns so that they would be useful in classifying new instances.  That's the magic of machine learning: being able to generalize from observed instances to things that haven't been seen before.

Next time, I'll give some examples of how machine learning is used by tools that you're probably already using.  I may even get in to specific types of machine learning techniques and what they can be used for.

Tags:

Sep 8 2008

What is Machine Learning?

Category: MachineLearningMatt @ 09:21

I'm going to be doing some posts on machine learning, artificial intelligence, and data mining over the coming weeks and months as I try to crank out a thesis.  Since machine learning isn't a topic that a lot of developers are familiar with, I decided it would be best to write up a brief summary of what machine learning encompasses and why you should care.  If you have any questions after reading this, please let me know in the comments.  I need to be very solid at explaining machine learning to an uninformed reader if I'm going to crank out a decent thesis...

Machine learning, at its most basic level, is a field that is concerned with techniques that allow computers to "learn".   It is often confused with artificial intelligence, data mining, and other related fields.  That's because many applications of machine learning (or AI, or data mining) tend to straddle the fence between fields as opposed to being purely in one domain or another.  So, machine learning is any computer system that has the capability to learn.  The formal definition of "learn" is typically something like "a system can learn if it can improve its performance at some task based on experience at performing that task".  Basically, if it can get better at something automagically, it is learning. 

Machine learning is a *huge* field with many diverse applications.  My area of research is focused mainly on two applications of machine learning: classification (a form of supervised learning, which we'll talk about in the future) and clustering (a form of unsupervised learning).  There are many other applications that will probably be discussed in future posts (if I ever get the itch). 

Classification deals with trying to classify some individual with respect to a target class or set of target classes.  For example, I'm working on a medical system that takes in a patient data (such as age, blood pressure, pulse, and respiratory rate) and tries to determine whether or not that person is critically injured.  The system does that by using machine learning.  We feed the system a training set, which contains patient data that has been labeled apriori with the correct class, and the system learns how to classify critically injured patients from non-critically injured ones.  The magic is that the machine is able to apply what it learns to patient data that it hasn't seen before.  If it wasn't able to make this inductive leap, the system would be a look-up table. 

Clustering deals with trying to place individuals together into groups (called clusters).  There are many techniques for doing this, but most are still accomplishing the same task: creating clusters wherein individuals in the same cluster are similar to one another and dissimilar to individuals in other clusters.  Clustering algorithms typically need to know a few things: the number of clusters that exist, and how to calculate the 'distance' between individuals (there are several clustering systems that can determine the number of clusters automatically, but most basic clustering algorithms are not capable of making that determination).  Unlike in classification where the goal is to predict some target classification, clustering doesn't care about class labels.  It looks at all the attributes of the data in the training set and groups them by similarity.  New items can then be clustered to previously built categories as needed.

Summary

So, machine learning deals with anything in which a computer system "learns"; it gets better at some task based on experience.  Machine learning is not a synonym for artificial intelligence, though the two are related.  It is  a very broad field that provides useful tools capable of solving complex problems in elegant ways.

In the next post on this topic, we'll delve in to why you should care about machine learning (here's a hint: you are probably already leveraging it and don't even know it, but if you're not, you're working way too hard!)

Tags: