Software development best practices

AI Okinawa

Jakub “Kuba” Kolodziejczyk



Kuba Kolodziejczyk
出身: ポーランド
大学: ロンドン大学, 大阪大学

過去
Nanyang Technological University
OIST
レキサス

現在
AI Okinawa - 代表
LiLz株式会社 - CTO
琉球大学 - 非常勤講師



Course content
  • Testing
  • Version control
  • Code reviews
  • Configuration management
  • Environment sandboxes
  • Continuous integration
  • Continuous deployment
  • Infrastructure as code



Testing
  • It's hard to write complex code correctly
  • Corner cases
  • Quickly notice bugs due to changes in underlying code
  • Refactor with confidence
  • Executable documentation
  • Leads to better design

Unit testing
  • get_factorial(x)
  • get_faces(image)


Acceptance testing
  • test_factorial_endpoint_with_simple_input()
  • Tests application starts up correctly
  • Tests environment - server has correctly libraries installed, etc
  • Tests services connections - we can connect to database, etc
  • Tests all software components (code, databases, external services, etc) work correctly together
  • Used to test "happy path" - typical interactions user has with software



Version control


Should I use git?


  • What is version control?
  • Branches
  • Merges
  • Resolving conflicts
  • Why using version control in personal project?
    • easily revert to last known state
    • don't worry about breaking something up
    • check what change introduced a problem (diff)
  • Why using version control in a team?
    • all of the above
    • don't step on each other's toes (branches)
    • combine code in a controlled manner (merges)
    • pull requests
    • code reviews
    • continuous integration



Code reviews
Code review process

  • Developer works on code in her own branch
  • Once developer finished a logical piece of work, sends pull request
  • Code is first reviewed by automatic tests - see continuous integration
  • A senior engineer reviews code - reads it, highlights issues if any
  • If code passes review, it's merged to master. Else developer fixes any issues and sends a new pull request


Benefits of code reviews

  • Catches bugs before they make it to master branch
  • Increases codebase understanding among team members
  • Training for junior (and senior!) engineers - reviewer looks at your code and gives you feedback
  • More cohesive architecture - reviewer can enforce similar style to whole codebase



Final boss one the quest to merge your code to master - the fearsome reviewer!



Configuration management

Typical environments
  • Development - one per engineer
  • Testing - checks code works on a machine with known configuration, not just your machine
  • Staging - as close to production as similar, but not used by client (if you have to mess up, do it there)
  • Production - client facing


Parts that differ between environments
  • Database and other services urls
  • Paths to resources
  • Ports application uses
  • Logging options


Options for passing configuration to application
  • Command line arguments
  • Environment variables
  • Configuration file


Command line arguments
Passing configuration to application:
./myapp --database-url=... --port=123 ---loglevel=info

Using configuration inside application:

# .myapp.py
def get_arguments_parser():

    parser = argparse.ArgumentParser()
    parser.add_argument("--database-url")
    parser.add_argument("--port")
    parser.add_argument("--loglevel")

    return parser


def foo(database_url):

    print("Doing something with {}".format(database_url))


def main():

    parser = get_arguments_parser()
    arguments = parser.parse_args()

    foo(arguments["database_url"])
Disadvantages:
• a lot of typing, thus not very convenient
• easy to make a mistake
• doesn't scale


Environment variables:

Passing configuration to application:
DATABASE_URL=... PORT=123 LOGLEVEL=info ./myapp
Suffers from exactly same problems as above

OR

# .bashrc
export DATABASE_URL=...
export PORT="123"
export LOGLEVEL="info"
Then from command line:
source ~/.bashrc
./myapp


Using configuration inside application:

# .myapp.py
import os

def foo():

    database_url = os.environ["DATABASE_URL"]
    print("Doing something with {}".format(database_url))


def main():

    foo()
Disadvantages:
• Configuration isn't stored in project, thus can't be version controlled
• Environment variables are essentially globals, so it's hard to see all inputs to a given piece of code


Configuration file:

Define configuration file:

# ./configurations/development_config.yaml
DATABASE_URL=...
PORT="123"
LOGLEVEL="info"
Passing configuration to application:
./myapp --config-path=./configuration/development_config.yaml

Using configuration inside application:

# .myapp.py
import yaml

def get_arguments_parser():

    parser = argparse.ArgumentParser()
    parser.add_argument("--config-path")

    return parser


def foo(database_url):

    print("Doing something with {}".format(database_url))


def main():

    parser = get_arguments_parser()
    arguments = parser.parse_args()

    with open(arguments["config-path"]) as file:

        config = yaml.loads(file)

    foo(config["database_url"])
Advantages:
• Version controlled
• Easy to add new options
• Visible in code




Sandboxing
Environment sandboxing



  • Environment resources
    • Executables
    • Compilers
    • Libraries
  • Problems in not managing resources per project
    • Can't easily reproduce environment on a different machine
    • Environment update on one project changes environment definition on another project
    • Environment update on one project can cause bugs on another project
  • Solution - environment management!
    • Project keeps an environment definition file that defines all resources and their versions
    • All resources are installed locally per project, rather than shared across the system
    • Environment definition file is version controlled, just like any other code
    • Using environment definition, exact environment can be reproduced on another machine
  • Sample environment definition file from Anaconda


Services sandboxing


  • Services used on a typical web project
    • Database
    • File storage
    • Web server configuration
  • Risks in modifying shared services
    • Break production
    • Break development for other engineers
    • Transient bugs
  • Solution - services sandboxes!
    • Each developer uses a local instance of service per project
    • Most popular tool for services sandboxing: Docker



Continuous integration
What is a continuous integration?

  • A systematic process for frequently merging new code into main codebase
  • Developers push their code to master branch often
  • Each push is run through automated tests and results are visible to everyone in company
  • Bad code is automatically rejected by integration server


What problems it tries to solve

  • Large and difficult merges
  • Developers working on old code
  • Bugs surfacing weeks after they were introduced


Typical stages in continuous integration pipeline

  • Commit stage
    • Build executables and other deliverables (e.g. documentation)
    • Run static code analysis
    • Run unit tests
  • Acceptance/Integration stage
    • Run tests interacting with the app (implies app has to be started for each test)
    • Run database migration tests
  • Optional - Performance profiling stage
    • Throughput stress tests
    • Load stress tests
  • Manual - code review stage
    • Lead engineer or other senior engineer reviews the code


When and where should continuous integration pipeline be triggered

  • Automatically on integration server on every commit to master
  • Automatically on integration server on every pull request
  • Automatically on integration server on every commit to feature branch
  • Manually on integration server for any of above at engineer's request
  • Manually on local machine at engineer's request


Popular continuous integration tools

  • Jenkins
  • GoCD
  • Travis CI
  • Quite a few others...




Continuous deployment

  • Release often, fail fast
  • Continuous deployment is continuous integration taken to its logical conclusion


Common reasons for deployments failures

  • It's a complex manual process and someone gets a step wrong, e.g.
    • forgets to rewrite symbolic link
    • forgets to update environment
    • uploads wrong external resources configuration (e.g. development config instead of production config)
    • doesn't link new version to correct load balancer
    • etc
  • New code doesn't handle existing data correctly - a regression bug
  • Bad database refactoring


Benefits of continuous deployment

  • Smaller window of change == smaller chance of failure
  • When you do fail, you have a smaller amount of code to look at to find the problem
  • Quicker time to marker
  • Forces deployment process automation - which decreases manual process risk
  • If you can deploy fast, then you can recover fast too


Deployment best practices

  • Staging environment and production environment
    • staging environment should be as similar to production as possible - real database, real load balancer, etc
    • but - never use production data in staging!
  • Automate as much as possible
  • Sanity tests for production environment




Infrastructure as code

Common infrastructure tasks

  • Provisioning servers
    • start virtual machine with correct hardware and software combination
    • update third party code
    • update app code
    • update operating system configuration
    • update test updates
  • Setting up virtual network
  • Setting up network security groups
  • Setting up application gateways and load balancers
  • Hooking up application gateways <-> load balancers <-> servers chain
  • Provisioning and configuring other resources
    • hard drives
    • cloud storage
    • databases
    • static IPs
    • etc


Problems with manually managing infrastructure

  • Low bus factor
  • Easy for mistakes to creep in
  • High risk of subtle differences between resources


Benefits of automating infrastructure tasks

  • Repeatable and consistent
  • Scalable
  • Faster (though not fast - provisioning servers often takes several few minutes)
  • Code constitutes an executable documentation - never out of date
  • Easier to inherit, refactor and learn from


Popular infrastructure as code tool

  • Terraform
  • AWS CloudFormation/Azure Resource Manager Templates
  • Ansible
  • SaltStack
  • Puppet
  • Chef