Details on Sklearn Classifiers

Timur
2 min readFeb 7, 2020

I was working on a downstream task dependent on the order of class labels of a classifier. And for some reasons, I really need to look into the interfaces and implementation details. Here are some details you might not be aware of. In this post, I will be using MLPClassifier as an example.

Non-integer labels can Be Used Directly

In general, the classification labels fed to nerual nets should be integers. And with Sklearn, LabelEncoder can easily encode target values y to value between 0 and n_classes-1.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
targets = ["red", "yellow", "blue", "red", "red"]
le.fit(targets)

Then, you can use the label encoder to transform targets easily.

le.transform(targets)
>> array([1, 2, 0, 1, 1])

And the original class labels can be accessed via

le.classes_
>> array(['blue', 'red', 'yellow'], dtype='<U6')

However, you can fit a classifier to non-integer labels directly without preprocessing labels.

clf = MLPClassifier(hidden_layer_sizes=256, activation='relu')
labels = [random.choice(["ayes", "nooo00"]) for _ in range(2500)]
x = []
for label in labels:
if label == "ayes":
x.append(np.asarray(torch.randn(128, dtype=torch.float) + 5))
else:
x.append(np.asarray(torch.randn(128, dtype=torch.float) - 5))
clf.fit(np.asarray(x), np.asarray(labels))

Class Label Access and Orders?

If the classifier is fit to non-integers directly, the original labels are mapped to internal class labels before training. And the class labels can be accessed via

clf.classes_
>> array(['ayes', 'nooo00'], dtype='<U6')

The order of labels in classes_ attribute is the order used for classifier training. In the case above, label ‘ayes’ is label 0 and ‘nooo00’ is label 1.

The mapped labels are actually sorted in alphabetical order, instead of maintaining the order of occurrence. And such order is consistent if you use LabelEncoder to transform labels automatically.

targets = ["red", "yellow", "blue", "red", "red"]
le.fit(targets)
le.classes_
>> array(['blue', 'red', 'yellow'], dtype='<U6')

In the example above, even though the first label that LabelEncoder encounters is “red”, the index is assigned to be 1 instead of 0.

--

--

Timur

Phd in Computer Science and procrastination. Research scientist @Meta.