Risan Bagja

Python Notes from Intro to Machine Learning

I rarely use Python. I only have one repository at Github that is written in Python: iris-flower-classifier. And it was written two years ago!

A few days ago I took this free course from Udacity: Intro to Machine Learning. The machine learning related codes are quite easy to grasp since it simply uses the scikit-learn modules. But most of the supporting Python modules that are provided by this course were like a black-box to me. I had no idea how to download a file in Python or what’s the difference between a list, a tuple and a dictionary.

That’s why I decided to read all of the provided Python modules and implement it myself. I ended up refactor most of the code so it’s easier to understand: github.com/risan/intro-to-machine-learning.

So here are some notes and snippets of Python that I’ve been collecting so far (I’m not even halfway through the course 😝). Also, note that the codes here are still using Python version 2.7.

Table of Contents

Modules Classes and Functions

Main Entry File

Suppose our Python project is stored in /foo/bar directory. And this application has one file that serves as the single entry point. We can name this file __main__.py so we can run this project simply be referencing its directory path:

# Referencing its directory.
$ python /foo/bar

# It's equivalent to this.
$ python /foo/bar/__main__.py

Import Python Module Dynamically

Suppose we would like to import a Python module dynamically based on a variable value. We can achieve this through the __import__ function:

module_name = "numpy"

__import__(module_name)

Multiple Returns in Python

In Python, it’s possible for a function or a method to return multiple values. We can do this simply by separating each return value by a comma:

def test():
    return 100, "foo"

someNumber, someString = test()

Importing Modules Outside of the Directory

In order to import a module from outside of the directory, we need to add that module’s directory path into the current file with sys.path.append. Suppose we have the following directory structure:

|--foo
| |-- bar.py
|
|-- tools
| |-- speak_yoda.py

If we want to use the speak_yoda.py module within the bar.py, we can do the following:

# /foo/bar.py
import os

# Use relative path to tools directory.
sys.path.append("../tools")

import speak_yoda

However, this won’t work if we run the baz.py file from outside of its foo directory:

# It works inside of the /foo directory.
$ cd /foo
$ python bar.py

# But it won't work if the code runs from outside of /foo directory.
$ python foo/bar.py

To solve this problem we can refer to the tools directory using its absolute path.

# /foo/bar.py
import os
import sys

# Get the directory name for this file.
current_dirname = os.path.dirname(os.path.realpath(__file__))

# Use the absolute path to the tools directory
tools_path = os.path.abspath(os.path.join(dirname, "../tools"))
sys.path.append(tools_path)

import speak_yoda

Output

It turns out you can’t just print an emoji or any other Unicode characters to the console. You need to specify the encoding type beforehand:

# coding: utf8

print("😅")

Pretty Print

We can use the pprint module to pretty-print Python data structure with a configurable indentation:

import pprint
pp = pprint.PrettyPrinter(indent=2)

pp.pprint(people)

Working with Pathname

Read more about pathname manipulations in the os.path documentation.

Get Filename From URL

Suppose the last segment of the URL contains a filename that we would like to download. We can extract this filename with the following code:

import os
from urlparse import urlparse

url = "https://example.com/foo.txt"

url_components = urlparse(url)

filename = os.path.basename(url_components.path) # foo.txt

Check if File Exists

To check whether the given file path exists or not:

import os

is_exists = os.path.isfile("foo.txt")

Create a Directory if It Does Not Exists

To create a directory only if it does not exist:

import os
import errno

try:
    os.makedirs(directory_path)
except OSError, e:
    if e.errno != errno.EEXIST:
        raise

Working with Files

Downloading a File

We can use the urllib module to download a file in Python. The first argument is the file URL that we would like to download. The second argument is the optional filename that will be used to store the file.

import urllib

urllib.urlretrieve("https://example.com/foo.txt", "foo.txt")

Extracting Tar File

There’s a built-in tarfile module that we can use to work with Tar file in Python. To extract the tar.gz file we can use the following code:

import tarfile

# Open the file.
tfile = tarfile.open("foo.tar.gz")

# Extract the file to the given path.
tfile.extractall(path)

We can pass the mode argument to the open method. By default, the mode would be r—reading mode with transparent compression. There are also other mode options that we can use:

Working with List

Generate a List of Random Numbers

Use the for..in syntax to generate a list of random numbers in a one-liner style.

import random

# Initialize internal state of random generator.
random.seed(42)

# Generate random points.
randomNumbers = [random.random() for i in range(0, 10)]
# [0.6394267984578837, 0.025010755222666936, 0.27502931836911926, ...]

Pair Values from Two Lists

The built-in zip function can pair values from two lists. However, this zip function will return a list of tuples instead. To get a list of value pairs, we can combine it with for..in syntax:

coordinates = [[x, y] for x,y in zip([5,10,15], [0,1,0])]
# [[5, 0], [10, 1], [15, 0]]

Splitting a List

We can easily split a list in Python by specifying the starting index and it’s ending index. Note that the ending index is excluded from the result.

We can also specify a negative index. And also note that both of these indices are optional!

a = [0,1,2,3,4,5]

a[0:3]  # 0,1,2
a[1:3]  # 1,2
a[2:]   # 2,3,4,5
a[:3]   # 0,1,2
a[0:-2] # 0,1,2,3
a[-2:]  # 4,5
a[:]    # 0,1,2,3,4,5

Filtering a List In One Line

We can easily filter a list in Python by combining the for..in and the if syntax together:

numbers = range(1,11)

# Filter even numbers only.
[numbers[i] for i in range(0, len(numbers)) if numbers[i] % 2 == 0]
# [2, 4, 6, 8, 10]

Sorting a List in Ascending Order

In Python, we can sort a list in ascending order simply by calling the sort method like so:

people = ["John", "Alice", "Poe"]
people.sort()
print(people) # ["Alice", "John", "Poe"]

Using Filter Function with a List

Just like its name, we can use the filter function to filter out our list:

numbers = range(1, 11)

even_numbers = filter(lambda number: number % 2 == 0, numbers)
# [2, 4, 6, 8, 10]

We can break the above statement into two parts:

Using Reduce with a List of Dictionary

We can use the reduce function to calculate the total of a particular key in a list of a dictionary:

items = [{value:10}, {value:20}, {value:50}]

# Calculate the total of value key.
totalValues = reduce(lambda total, item: total + item["value"], items, 0) # 80

It can be broken down into 4 parts:

We can also use this reduce function to find a single item from the list. Here’s an example of code to find the person with the biggest total_payments within the given list of people dictionary.

people = [
    {"name": "John", "total_payments": 100},
    {"name": "Alice", "total_payments": 1000},
    {"name": "Poe", "total_payments": 800}
]

person_biggest_total_payments = reduce(lambda paid_most, person: person if person["total_payments"] > paid_most["total_payments"] else paid_most, people, { "total_payments": 0 })
# {'name': 'Alice', 'total_payments': 1000}

Working with Dictionary

Loop Through Dictionary

We can use the itervalues method to loop through a dictionary:

for person in people.itervalues():
    print(person["email_address"])

We can also use the iteritems method if we want to access the key too:

for person in people.iteritems():
    print(person[0] + ": " + person[1]["email_address"])

Calculate Total of Particular Dictionary Key

Suppose we would like to calculate the total amount of salary key on a people dictionary. We can extract the salary key and use the sum function to get the total:

total_salary = sum([person["salary"] for person in people.itervalues()])

Working with Numpy

Numpy Create Range of Values with The Given Interval

Use the arange method to create an array with an evenly spaced interval.

import numpy as np

np.arange(0, 5, 1)
# array([0,1,2,3,4])

np.arange(1, 4, 0.5)
# array([1. , 1.5, 2. , 2.5, 3. , 3.5])

Numpy Create Coordinate Matrices from Coordinate Vectors

We can use the Numpy meshgrid method to make coordinate matrices from one-dimentional coordinate arrays.

import numpy as np

np.meshgrid([1, 2, 3], [0, 7])
# [
#   array([[1,2,3], [1,2,3]]),
#   array([[0,0,0], [7,7,7]])
# ]

Flatten Numpy Array

When we have a multi-dimensional Numpy array, we can easily flatten it with the ravel method:

import numpy as np

arr = np.array([[1,2], [3,4]])
arr.ravel()
# array([1, 2, 3, 4])

Pairing Array Values with Second Axis

We can use Numpy c_ function to pair array values with another array that will be it’s second axis. Read the numpy.c_ documentation.

import numpy as np

x = [1,2]
y = [10,20]

np.c_[x, y]
# array([1,10], [2,20])

Generate Coordinates Across The Grid

With the knowledge of Numpy arange, meshgrid, ravel and c_ methods, we can easily generate an evenly spaced coordinates across the grid so we can pass it to the classifier and plot the decision surface.

import numpy as np

# Generate an evenly spaced coordinates.
x_points, y_points = np.meshgrid(np.arange(x_min, x_max, step), np.arange(y_min, y_max, step))

# Pair the x and y points.
test_coordinates = np.c_[x_points.ravel(), y_points.ravel()]

Plotting the Data

Plot The Surface Decision

We can pass an evenly spaced coordinates across the grid to the classifier to predict the output on each of that coordinate. We can then use matplotlib.pyplot to plot the surface decision.

import matplotlib.pyplot as plt
import pylab as pl

# Pass coordinates across the grid.
predicted_labels = classifier.predict(test_coordinates)

# Don't forget to reshape the output array dimension.
predicted_labels = predicted_labels.reshape(x_points.shape)

# Set the axes limit.
plt.xlim(x_points.min(), x_points.max())
plt.ylim(y_points.min(), y_points.max())

# Plot the decision boundary with seismic color map.
plt.pcolormesh(x_points, y_points, predicted_labels, cmap = pl.cm.seismic)

The classifier output would be a one-dimensional array, so don’t forget to reshape it back into a two-dimensional array before plotting. The cmap is an optional parameter for the color map. Here we use the seismic color map from pylab module. It has the red-blue colors.

Scatter Plot

We need to separate the test points based on its predicted label (the speed). So we can plot the test points with two different colors.

# Separate fast (label = 0) & slow (label = 1) test points.
grade_fast = [features_test[i][0] for i in range(0, len(features_test)) if labels_test[i] == 0]
bumpy_fast = [features_test[i][1] for i in range(0, len(features_test)) if labels_test[i] == 0]
grade_slow = [features_test[i][0] for i in range(0, len(features_test)) if labels_test[i] == 1]
bumpy_slow = [features_test[i][1] for i in range(0, len(features_test)) if labels_test[i] == 1]

# Plot the test points based on its speed.
plt.scatter(grade_fast, bumpy_fast, color = "b", label = "fast")
plt.scatter(grade_slow, bumpy_slow, color = "r", label = "slow")

# Show the plot legend.
plt.legend()

# Add the axis labels.
plt.xlabel("grade")
plt.ylabel("bumpiness")

# Show the plot.
plt.show()

If we want to save the plot into an image, we can use the savefig method instead:

plt.savefig('scatter_plot.png')

Dealing with Data

Deserializing Python Object

We can use pickle module for serializing and deserializing Python object. There’s also the cPickle—the faster C implementation. We use both of these modules to deserialize the email text and author list.

import pickle
import cPickle

# Unpickling or deserializing the texts.
texts_file_handler = open(texts_file, "r")
texts = cPickle.load(texts_file_handler)
texts_file_handler.close()

# Unpickling or deserializing the authors.
authors_file_handler = open(authors_file, "r")
authors = pickle.load(authors_file_handler)
authors_file_handler.close()

Split Data for Training and Testing

We can use the built-in train_test_split function from scikit-learn to split the data both for training and testing.

from sklearn.model_selection import train_test_split

features_train, features_test, labels_train, labels_test = train_test_split(texts, authors, test_size = 0.1, random_state = 42)

The test_size argument is the proportion of data to split into the test, in our case we split 10% for testing.

Vectorized the Strings

When working with a text document, we need to vectorize the strings into a list of numbers so it’s easier and more efficient to process. We can use the TfidfVectorizer class to vectorize the strings into a matrix of TF-IDF features.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(sublinear_tf = True, max_df = 0.5, stop_words = "english")
features_train_transformed = vectorizer.fit_transform(features_train)
features_test_transformed = vectorizer.transform(features_test)

Word with a frequency higher than the max_df will be ignored. Stop words are also ignored—stop words are the most common words in a language (e.g. a, the, has).

Feature Selection

Text can have a lot of features thus it may slow to compute. We can use scikit SelectPercentile class to select only the important features.

selector = SelectPercentile(f_classif, percentile = 10)
selector.fit(features_train_transformed, labels_train)
selected_features_train_transformed = selector.transform(features_train_transformed).toarray()
selected_features_test_transformed = selector.transform(features_test_transformed).toarray()

The percentile is the percentage of features that we’d like to select based on its highest score.