Clean Code

AI Okinawa

Jakub “Kuba” Kolodziejczyk

Kuba Kolodziejczyk

出身: ポーランド
大学: ロンドン大学, 大阪大学

過去
Nanyang Technological University
OIST
レキサス

現在
AI Okinawa - 代表
LiLz株式会社 - CTO
琉球大学 - 非常勤講師

Don’t assume textbooks and tutorials show a good way to write code

int marks = 100, *p1, *p2;
p1 = &marks;
p2 = p1;

int marks = 100 ;
int *first_marks_pointer ;
int *second_marks_pointer ;

first_marks_pointer = &marks;
second_marks_pointer = first_marks_pointer;

Naming rules I
Use natural, descriptive language

for b in bb:
    for j in c:
        fff(b, j)

for box in boxes:
    for color in colors:
        draw(box, color)

Naming rules II
Nouns for objects, verbs for actions, plurals for plurals

nouns for objects
verbs for actions
plurals for plurals

my_house = House()
value = my_house.get_value()

rooms = ["kitchen", "garage"]
rooms.append("living room")

Naming rules III
Don’t use abbreviations

cat = len(catDic.keys())
imgSz = 1280, 1280

genFctr = DataBtchsGenFctr(datDir, cat, imgSz)

categories_count = len(categories_dictionary.keys())
image_size = 1280, 1280

generator_factory = DataBatchesGeneratorFactory(data_directory, categories_count, image_size)

Use names that correctly indicate responsibility

def get_volume(box):

    first_distance = get_distance(box[0], box[1])
    second_distance = get_distance(box[1], box[2])
    third_distance = get_distance(box[2], box[0])

    return first_distance + second_distance + third_distance

area = get_volume(box)

def get_circumference(triangle):

    first_distance = get_distance(triangle[0], triangle[1])
    second_distance = get_distance(triangle[1], triangle[2])
    third_distance = get_distance(triangle[2], triangle[0])

    return first_distance + second_distance + third_distance

circumference = get_circumference(triangle)

Unit tests

Why we ♥ tests:

Tests check if your new, shiny function is correct
Tests check if your refactorings won't break anything
Tests help you write good code, because bad code is hard to test
Tests form an executable documentation

def get_max(values):
    """
    A simple function to get maximum value from a list of values
    """

    max_value = values[0]

    for value in values[1:]:
        if value > max_value:
            max_value = value

    return max_value

# Execute with py.test or similar
def test_get_max():

    values = [2, 5, 10, 1]

    expected = 10
    actual = get_max(values)

    assert expected == actual

Harder example

# Are you sure you can write this function without any bugs?
def get_polygon_area(polygon):
    
    # Loads of math
    # ...
    return area

With unit tests you will know if your code is right or wrong!

def test_get_polygon_area_square():

        polygon = [(0, 0), (10, 0), (10, 10), (0, 10)]

        expected == 100
        actual = get_polygon_area(polygon)

        assert expected == actual


def test_get_polygon_area_rectangle():

    polygon = [(0, 0), (20, 0), (20, 10), (0, 10)]

    expected == 200
    actual = get_polygon_area(polygon)

    assert expected == get_polygon_area(polygon)


def test_get_polygon_area_triangle():

    polygon = [(0, 0), (10, 0), (5, 5)]

    expected == 25
    actual = get_polygon_area(polygon)

    assert expected == get_polygon_area(polygon)

But writing tests slows me down!

Consider this:

If you don't know your code is correct, how can you tell you're done? If you don't know when you're done, how can you measure if tests slow you down?
Is it better to write correct function in 30min, or wrong one in 10min and next week have entire team blocked for 4h while you try to figure out what needs fixing and how to fix it?
Do you prefer to find your bugs yourself, or would you rather have an angry customer bring them up to your boss?

Testing makes your code better

Say you have to write functionality that counts number of words in some text files, and you come up following code:

def get_words_count(path):

    with open(path) as file:

        text = file.read()

        matches = re.findall(pattern=r"\w+", string=text)
        return len(matches)

This code looks ok on cursory inspection, but what would happen if you were to test it?
You would have to do something like this:

def test_get_words_count_simple_input():

    with tempfile.NamedTemporaryFile(mode="w") as file:

        file.write("some simple text")
        file.flush()

        expected = 3
        actual = get_words_count(file.name)

    assert expected == actual


def test_get_words_count_input_contains_special_characters():

    with tempfile.NamedTemporaryFile(mode="w") as file:

        file.write("Some !!! special ??? characters --- here")
        file.flush()

        expected = 4
        actual = get_words_count(file.name)

    assert expected == actual

But why is that necessary? After all counting words shouldn't require you to create a file, right?
Creating a file has nothing to do with functionality you want to test, so why is it necessary to perform testing?

If testing the functionality is not convenient, then that functionality likely isn't written in a flexible way and won't be easy to use under different circumstances.
Say what would happen if I wanted to count number of words on a webpage available online? I would have to download to local drive first, and then take care of deleting it.

All of this suggests that there are some problems with our function. Now let's think what would be easy to test?
How about

def get_words_count(text):

    matches = re.findall(pattern=r"\w+", string=text)
    return len(matches)

Now testing is simple too!

def test_get_words_count_simple_input():

    expected = 3
    actual = get_words_count("some simple text")

    assert expected == actual


def test_get_words_count_input_contains_special_characters():

    expected = 4
    actual = get_words_count("Some !!! special ??? characters --- here")

    assert expected == actual

And with this setup it will be easy to use our function with a wide range of input sources.

TL;DR: Testing makes your code better!

Command and query separation

Command: changes state of resource
Query: retrieves information

Advantages of queries:

Calls that don't change system are immediately obvious in code
Can be called in any order
Simple to test

def get_total_weight(products):

    total_weight = 0

    for product in products:

        if product.weight is None:
            product.weight = 0

        total_weight += product.weight

    return total_weight

products = [Product("brick", 1), Product("post_stamp"), Product("ball", 0.5)]
        
products_with_no_weight_set = get_products_with_no_weight_set(products)
print(products_with_no_weight_set)
# prints [Product("post_stamp")]

products = [Product("brick", 1), Product("post_stamp"), Product("ball", 0.5)]

total_weight = get_total_weight(products)
print(total_weight)
# prints 1.5

products_with_no_weight_set = get_products_with_no_weight_set(products)
print(products_with_no_weight_set)
# returns []

Better solution

def get_total_weight(products):

    total_weight = 0

    for product in products:

        if product.weight is not None:
            total_weight += product.weight

    return total_weight

products = [Product("brick", 1), Product("post_stamp"), Product("ball", 0.5)]

total_weight = get_total_weight(products)
print(total_weight)
# prints 1.5

products_with_no_weight_set = get_products_with_no_weight_set(products)
print(products_with_no_weight_set)
# returns [Product("post_stamp")]

Now results are as expected.
Ok, but what about commands that require asking questions? E.g. what if I want to delete files older than some date?

# Command can make queries - no problem
def delete_files_older_than(directory, date):

    # Query
    old_files = get_files_older_than(directory, date)

    # Command
    delete_files(old_files)

So calling a query from inside a command is fine.

What if I want to call a command from inside a query?
That would mean query changes the system - so it's not a query ^^

Ok, but what if I want to delete old files and return back how many files are left?
It's best to split this logic into two functions

get_files_older_than(directory, date)
get_files_count(directory)

Single responsibility principle I - classes
Class should perform only one type of tasks

class AreaCalculator:

    def get_triangle_area(triangle):
            # ...

    def get_rectangle_area(rectangle):
            # ...

    def get_polygon_area(polygon):
            # ...

    def split_string(text):
            # ...

class AreaCalculator:

    def get_triangle_area(triangle):
        # ...

    def get_rectangle_area(rectangle):
        # ...

    def get_polygon_area(polygon):
        # ...

class StringFormatter:

    def split_string(text):
        # ...

Single responsibility principle II - functions I
Function should return only one type of objects, and preferably only one object

def get_image_info(image):

    # Loads of complicated computations, 10 lines of code
    size = ....

    # Loads of complicated computations, 100 lines of code
    histogram = ...

    # Loads of complicated computations, 200 lines of code
    faces_bounding_boxes = ...

    # Loads of complicated computations, 200 lines of code
    cats_bounding_boxes = ...

    return size, histogram, faces_bounding_boxes, cats_bounding_boxes

def get_image_size(image):

    # Loads of complicated computations, 10 lines of code
    size = ....
    return size

def get_image_histogram(image):

    # Loads of complicated computations, 100 lines of code
    histogram = ...
    return histogram

def get_faces_bounding_boxes(image):

    # Loads of complicated computations, 200 lines of code
    faces_bounding_boxes = ...
    return faces_bounding_boxes

def get_cats_bounding_boxes(image):

    # Loads of complicated computations, 200 lines of code
    cats_bounding_boxes = ...
    return cats_bounding_boxes

Single responsibility principle III - functions II
Move code that can be encapsulated to its own function

def get_car_value(model, year, mileage, accidents_history):

    base_model_value = get_base_model_value(model)
    age_discount = get_age_discount(get_current_year() - year)

    value_loss_due_to_mileage = get_value_loss_due_to_mileage(mileage)
    value_loss_due_to_accidents = get_value_loss_due_to_accidents(accidents_history)

    total_value = (age_discount * base_model_value) - value_loss_due_to_mileage - value_loss_due_to_accidents

    min_value = 0
    max_value = 100000

    if total_value < min_value:
        total_value = min_value

    if total_value > max_value:
        total_value = max_value


    return total_value

def get_car_value(model, year, mileage, accidents_history):

    base_model_value = get_base_model_value(model)
    age_discount = get_age_discount(get_current_year() - year)

    value_loss_due_to_mileage = get_value_loss_due_to_mileage(mileage)
    value_loss_due_to_accidents = get_value_loss_due_to_accidents(accidents_history)

    total_value = (age_discount * base_model_value) - \
            value_loss_due_to_mileage - value_loss_due_to_accidents

    clipped_value = clip(total_value, min_value=0, max_value=100000)
    return clipped_value


def clip(value, min_value, max_value):

    if value > max_value:
         return max_value

    if value < min_value:
         return min_value

    return value

Don’t repeat yourself (DRY) I - identical code

class BooksShoppingWidget:

    def __init__(books):

        # Display top navigation widget
        ...

        # Build query widget, 20 lines of code
        ...

        # Build books list
        for book in books:

            # Display book cover
            ...

            # Display title
            ...

            # Get first 100 characters of book description
            description = []

            with open(book.url) as book_data:
                while len(description) < 100:

                    line = book_data.read_line()
                    description.extend(line)

            # Display description
            ... = description[:100]


        # build recommendations widget, 20 lines of code
        ...


class RandomBookDescriptionWidget:

    def __init__(books):

        index = get_random_integer(0, len(books))
        book = books[index]

        # Get first 100 characters of book description
        description = []

        with open(book.url) as book_data:
            while len(description) < 100:

                line = book_data.read_line()
                description.extend(line)

        # Display description
        ... = description[:100]

def get_book_description(book):

    # Get first 100 characters of book description
    description = []

    with open(book.url) as book_data:
        while len(description) < 100:

            line = book_data.read_line()
            description.extend(line)

    return description[:100]


class BooksShoppingWidget:

    def __init__(books):

        # Display top navigation widget
        ...

        # Build query widget, 20 lines of code
        ...

        # Build books list
        for book in books:

            # Display book cover
            ...

            # Display title
            ...

            # Display description
            ... = get_book_description(book)

        # build recommendations widget, 20 lines of code
        ...


class RandomBookDescriptionWidget:

    def __init__(books):

        index = get_random_integer(0, len(books))
        book = books[index]

        # Display description
        ... = get_book_description(book)

Don’t repeat yourself (DRY) II - nearly identical code

class BooksShoppingWidget:

    def __init__(books):

        # Display top navigation widget
        ...

        # Build query widget, 20 lines of code
        ...

        # Build books list
        for book in books:

            # Display title
            ...

            # Display book cover
            ...

            # Get first 100 characters of book description
            description = []

            with open(book.url) as book_data:

                while len(description) < 100:

                    line = book_data.read_line()
                    description.extend(line)

        # Display description
            ... = description[:100]


        # build recommendations widget, 20 lines of code
        ...


class RandomBookDescriptionWidget:

    def __init__(books):

        index = get_random_integer(0, len(books))
        book = books[index]

        # Get first 200 characters of book description
        description = []

        with open(book.url) as book_data:
            while len(description) < 200:

                line = book_data.read_line()
                description.extend(line)

        # Display description
        ... = description[:200]

def get_book_description(book, characters_count):

    description = []

    with open(book.url) as book_data:
        while len(description) < characters_count:

            line = book_data.read_line()
            description.extend(line)

    return description[:characters_count]


class BooksShoppingWidget:

    def __init__(books):

        # Display top navigation widget
        ...

        # Build query widget, 20 lines of code
        ...

        # Build books list
        for book in books:

            # Display title
            ...

            # Display book cover
            ...

            # Display description
            ... = get_book_description(book, 100)


        # build recommendations widget, 20 lines of code
        ...


class RandomBookDescriptionWidget:

    def __init__(books):

        index = get_random_integer(0, len(books))
        book = books[index]

        # Display description
        ... = get_book_description(book, 200)

Explicit is always better than implicit I - using globals

Say you see code like this:

# script.py
text = "I like cats. I like dogs. Dogs don't like cats. I wish they would like each other."

frequent_words = get_frequent_words(text)
print(frequent_words) # prints ['I', 'like']

But how does this code decide which words are frequent? We have to look inside the function to find out:

import config

def get_frequent_words(text):

    words = re.findall(r"\w+", text)
    words_counter = collections.Counter(words)
    
    frequent_words = [word for word, count in words_counter.items() 
                      if config.frequent_words_threshold <= count]

    return frequent_words

# Inside config module
frequent_words_threshold = 3

You have to read content of get_frequent_words(...) and of config module to understand what’s going on.
That's an implicit code - it hides some of its inputs from the reader, so it is difficult to understand how it works.

change globals into function arguments
move configuration values into script entry point

# script.py
import config

text = "I like cats. I like dogs. Dogs don't like cats. I wish they would like each other."

frequent_words = get_frequent_words(text, config.frequent_words_threshold)
print(frequent_words) # prints ['I', 'like']

Explicit is always better than implicit II - assuming your clients

def get_house_value(house_data):

    if house_data.last_evaluation_date == time.today:
        return -1

    # Look at size, number of rooms, neighbourhood
    # and other factors, expensive computation
    value = ...

    return value

Why -1 is returned when last_evaluation_date == time.today?

class PropertiesManager:

    # Other code
    # ...

    def get_house_data(id):

        house_data = database.get_house_data(id)
        value = get_house_value(house_data)

        if value == -1:
           value = self.cached_houses_values[id]

        house_data.value = value
        return house_data

Ok, and what would happen if someone who doesn’t know that -1 means cached results should be used calls the function?
Solution: move cache condition check to PropertiesManager

def get_house_value(house_data):

    # Look at size, number of rooms, neighbourhood
    # and other factors, expensive computation
    value = ...
    return value


class PropertiesManager:

    # Other code
    # ...

    def get_house_data(id):

        house_data = database.get_house_data(id)

        if house_data.last_evaluation_date != time.today:
            self.cached_houses_values[id] = get_house_value(house_data)

        house_data.value = self.cached_houses_values[id]
        return house_data

Error codes vs exceptions

Task: print a document

# Using error codes
def print(document, print_settings):

    # Connect to printer
    printer_connection, error = get_printer_connection()

    if error != 0:
        return error

    # Get documents data stream
    document_data_stream, error = document.get_data_stream()

    if error != 0:
        return error

    error = printer_connection.print(document_data_stream, print_settings)

    # 0 if succeeded, else non-zero
    return error


# In client:
error = print(document, print_settings)

if error != 0:
    # Do something

# With exceptions:
def print(document, print_settings):

    # Connect to printer
    printer_connection = get_printer_connection()

    # Get documents data stream
    document_data_stream = document.get_data_stream()

    # Print
    printer_connection.print(document_data_stream, print_settings)


# In client
try:
    print(document, print_settings)
except:
    # Do something

Splitting complex classes into components

Suppose you have to write an app with a reasonably complex GUI

Here's a naive way to go about it

class App:
        
    def __init__(self):

        # Create buttons for selecting time ranges used by expenses graphics
        time_ranges_buttons_container = Container(...)

        self.all_button = Button("All", ...)
        time_ranges_buttons_container.add(time_ranges_buttons_container)

        self.weekly_button = Button("Weekly", ...)
        time_ranges_buttons_container.add(time_ranges_buttons_container)

        self.monthly_button = Button("Monthly", ...)
        time_ranges_buttons_container.add(time_ranges_buttons_container)

        self.yearly_button = Button("Yearly", ...)
        time_ranges_buttons_container.add(time_ranges_buttons_container)

        # 20 lines of UI code around buttons

        self.layout.append(time_ranges_buttons_container)

        # 500 more lines of code building other widgets

Every single button, label and other simple widget on the screen is defined inside App class. This makes App class very large and complex - hence hard to work with and easy to break.

Instead of writing everything inside App class, search for components that could encapsulate simple elements.

class App:
    
    def __init__(self):

        self.time_ranges_container = TimeRangesContainer(...)
        self.layout.append(self.time_ranges_container)

        # Access buttons with self.time_ranges_container.weekly_button, etc

        # 50 lines or less of code building other widgets, since we encapsulated them in components as well


class TimeRangesContainer(Container):

    def __init__(...):

        self.all_button = Button("All", ...)
        self.add(self.all_button)

        self.weekly_button = Button("Weekly", ...)
        self.add(self.weekly_button)

        self.monthly_button = Button("Monthly", ...)
        self.add(self.monthly_button)

        self.yearly_button = Button("Yearly", ...)
        self.add(self.yearly_button)

        # 20 lines of UI code around buttons

Anonymous functions and their drawbacks

Without anonymous function:

function on_button_clicked() {
    // Do something
}

$(".btn_1").click(on_button_clicked);

With anonymous function:

$(".btn_1").click(function() {
  // Do something
});

The problem is you can't test that. So how do you know if code below is correct?

$(".phone_number").focusout(function() {
  // get text
  var phone_number = $(".phone_number").val() ;

  // format it in xxx-yyyy-zzz form
  // ...

  // write it back to phone_number_form
  $(".phone_number").val(formatted_phone_number) ;
});

If instead you write

function get_formatted_phone_number(phone_number) {

    // format it in xxx-yyyy-zzz form
    // ....
    return formatted_phone_number ;
}


$(".phone_number").focusout(function() {

    var phone_number = $(".phone_number").val() ;
    var formatted_phone_number = get_formatted_phone_number(phone_number) ;
    $(".phone_number").val(formatted_phone_number) ;
});

you can test that your logic is correct

test('phone number is correctly formatted', () => {

    expected = "123-456-789"
    actual = get_formatted_phone_number("123456789")

    expect(actual).toBe(expected)
});

Use sets for filtering unique elements

import time
import numpy as np


def get_unique_elements(data):

    unique_elements = []

    for element in data:

        element_already_added = False

        for unique_element in unique_elements:

            if element == unique_element:
                element_already_added = True
                # Ugly
                break

        if not element_already_added:
            unique_elements.append(element)

    return unique_elements


def main():

    data = np.random.randint(0, 100, 1000000)

    start = time.time()
    first_unique_elements = get_unique_elements(data)
    print("get_unique_elements(...) computation time: {:.5f} sec".format(time.time() - start))

    start = time.time()
    second_unique_elements = set(data)
    print("set(...) computation time: {:.5f} sec".format(time.time() - start))

    print("Are both unique elements arrays the same: {}".format(
        sorted(first_unique_elements) == sorted(second_unique_elements)))


if __name__ == "__main__":
    main()

python ./main.py
get_unique_elements(...) computation time: 1.90668 sec
set(...) computation time: 0.07359 sec
Are both unique elements the same: True

Safe interface changes with branching

Problem: Find best pokemon to start you battle!

class Trainer:

    # Loads of other code
    ...

    def get_starting_pokemon(self, opponents_pokemons):

        scores = []

        for current_pokemon in self.my_pokemons:

            current_pokemon_score = 0

            for opponent_pokemon in opponents_pokemons:

                is_current_pokemon_expected_winner = \
                    get_expected_winner(current_pokemon, opponent_pokemon) == current_pokemon
                    
                current_pokemon_score += int(is_current_pokemon_expected_winner)

            scores.append(current_pokemon_score)

        return self.my_pokemons(np.argmax(scores))

def get_expected_winner(first_pokemon, second_pokemon):
    
    # Compare levels, stats, and types to decide
    return ...

Ok, this worked in Pokemon Red/Green/Yellow, but we entered 21st Century, we have Pokemon Sun/Moon (and more) and now battlefield type influences Pokemons' stats as well!
We need to change get_expected_winner(current_pokemon, opponent_pokemon) to:

def get_expected_winner(first_pokemon, second_pokemon, battlefield_type):
    
    # Compare levels, strength, types and battlefield bonuses to decide
    return ...

But if we try to change get_expected_winner(...) now, we will break the client (Trainer class) and it will stay broken until we finish writing get_expected_winner(...). What if this will take us more than a few hours? Whole codebase will be broken during that time and we won't be able run our software.

Instead use a branching method:

1. Create a copy of get_expected_winner(first_pokemon, second_pokemon) called get_expected_winner_new(first_pokemon, second_pokemon)

def get_expected_winner(first_pokemon, second_pokemon):

    # Compare levels, stats, and types to decide
    return ...

def get_expected_winner_new(first_pokemon, second_pokemon):

    # Compare levels, stats, and types to decide
    return ...

2. Add new argument: get_expected_winner_new(first_pokemon, second_pokemon, battlefield_type)

def get_expected_winner(first_pokemon, second_pokemon):

    # Compare levels, stats, and types to decide
    return ...

def get_expected_winner_new(first_pokemon, second_pokemon, battlefield_type):

    # Compare levels, stats, types and battlefield type influence to decide
    return ...

3. Work on get_expected_winner_new(first_pokemon, second_pokemon, battlefield_type) until it's completed
4. Change Trainer class to use get_expected_winner_new(first_pokemon, second_pokemon, battlefield_type) and make sure it works as expected

class Trainer:

    # Loads of other code
    ...

    def get_starting_pokemon(self, opponents_pokemons):

        scores = []

        for current_pokemon in self.my_pokemons:

            current_pokemon_score = 0

            for opponent_pokemon in opponents_pokemons:

                # is_current_pokemon_expected_winner = \
                #    get_expected_winner(current_pokemon, opponent_pokemon) == current_pokemon

                is_current_pokemon_expected_winner = \
                    get_expected_winner_new(current_pokemon, opponent_pokemon, battlefield_type) == current_pokemon

                current_pokemon_score += int(is_current_pokemon_expected_winner)

            scores.append(current_pokemon_score)

        return self.my_pokemons(np.argmax(scores))

5. Remove get_expected_winner(first_pokemon, second_pokemon) and make sure everything works (maybe it had more clients you forgot about?)

# def get_expected_winner(first_pokemon, second_pokemon):

#     # Compare levels, stats, and types to decide
#     return ...

def get_expected_winner_new(first_pokemon, second_pokemon, battlefield_type):

    # Compare levels, stats, types and battlefield type influence to decide
    return ...

6. Rename get_expected_winner_new(first_pokemon, second_pokemon, battlefield_type) to get_expected_winner(first_pokemon, second_pokemon, battlefield_type)

def get_expected_winner(first_pokemon, second_pokemon, battlefield_type):

    # Compare levels, stats, types and battlefield type influence to decide
    return ...

Using branching, you will arrive where you wanted to be, without ever breaking the client and always being able to run your software.

Write code for reader who doesn't know anything about your project