Tutorial

Welcome to ML2P! In this tutorial we’ll take you through:

  • setting up your first project

  • uploading a training dataset to S3

  • training a model

  • deploying a model

  • making predictions

Throughout all of this we’ll be working with the classic Boston house prices dataset that is available within scikit-learn.

Setting up your project

Before running ML2P you’ll need to create:

  • a docker image,

  • a S3 bucket,

  • and an AWS role

yourself. ML2P does not manage these for you.

Once you have the docker image, bucket and role set up, you are ready to create your ML2P configuration file. Save the following as ml2p.yml:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
project: "ml2p-tutorial"
s3folder: "s3://your-s3-bucket/"
models:
  boston: "models.BostonModel"
defaults:
  image: "XXXXX.dkr.ecr.REGION.amazonaws.com/your-docker-image:X.Y.Z"
  role: "arn:aws:iam::XXXXX:role/your-role"
train:
  instance_type: "ml.m5.large"
deploy:
  instance_type: "ml.t2.medium"
  record_invokes: true  # record predictions in the S3 bucket
notebook:
  instance_type: "ml.t2.medium"
  volume_size: 8  # Size of the notebook server disk in GB

The yellow high-lights show the lines where you’ll need to fill in the details for the docker image, S3 bucket and AWS role.

Initialize the ML2P project

You’re now ready to run your first ML2P command!

First set the AWS profile you’d like to use:

$ export AWS_PROFILE="my-profile"

You’ll need to set the AWS_PROFILE or otherwise provide your AWS credentials whenever you run an ml2p command.

If you haven’t initialized your project before, run:

$ ml2p init

which will create the S3 model and dataset folders for you.

Once you’ve run ml2p init, ML2P have will created the following folder structure in your S3 bucket:

s3://your-s3-bucket/
  models/
    ... ml2p will place the outputs of your training jobs here ...
  datasets/
    ... ml2p will place your datasets here, the name of the
        subfolder is the name of the dataset ...

Creating a training dataset

First create a CSV file containing the Boston house prices that you’ll be using to train your model. You can do this by saving the file below as create_boston_prices_csv.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# -*- coding: utf-8 -*-

""" A small script for creating a Boston house price data training set.
"""

import pandas
import sklearn.datasets


def write_boston_csv(csv_name):
    """ Write a Boston house price training dataset. """
    boston = sklearn.datasets.load_boston()
    df = pandas.DataFrame(boston["data"], columns=boston["feature_names"])
    df["target"] = boston["target"]
    df.to_csv(csv_name, index=False)


if __name__ == "__main__":
    write_boston_csv("house-prices.csv")

and running:

$ python create_boston_prices_csv.py

This will write the file house-prices.csv to the current folder.

Now create a training dataset and upload the CSV file to it:

$ ml2p dataset create boston-20200901
$ ml2p dataset up boston-20200901 house-prices.csv

And check that the contents of the training dataset is as expected by listing the files in it:

$ ml2p dataset ls boston-20200901

It is also possible to generate a dataset by implementing a subclass of ml2p.core.ModelDatasetGenerator. The subclass needs to define a .generate(…) method that will generate the dataset and store it.

A simple implementation for the Boston house price dataset generator can be found in model.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# -*- coding: utf-8 -*-

""" A model for predicting Boston house prices (part of the ML2P tutorial).
"""

import jsonpickle
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from ml2p.core import Model, ModelDatasetGenerator, ModelPredictor, ModelTrainer

from .create_boston_prices_csv import write_boston_csv


class BostonDatasetGenerator(ModelDatasetGenerator):
    def generate(self):
        """Generate and store the dataset."""
        write_boston_csv("house-prices.csv")
        self.upload_to_s3("house-prices.csv")


class BostonTrainer(ModelTrainer):
    def train(self):
        """Train the model."""
        training_channel = self.env.dataset_folder()
        training_csv = str(training_channel / "house-prices.csv")
        df = pd.read_csv(training_csv)
        y = df["target"]
        X = df.drop(columns="target")
        features = sorted(X.columns)
        X = X[features]
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
        model = LinearRegression().fit(X_train, y_train)
        with (self.env.model_folder() / "boston-model.json").open("w") as f:
            f.write(jsonpickle.encode({"model": model, "features": features}))


class BostonPredictor(ModelPredictor):
    def setup(self):
        """Load the model."""
        with (self.env.model_folder() / "boston-model.json").open("r") as f:
            data = jsonpickle.decode(f.read())
            self.model = data["model"]
            self.features = data["features"]

    def result(self, data):
        """Perform a prediction on the given data and return the result.

        :param dict data:
            The data to perform the prediction on.

        :returns dict:
            The result of the prediction.
        """
        X = pd.DataFrame([data])
        X = X[self.features]
        price = self.model.predict(X)[0]
        return {"predicted_price": price}


class BostonModel(Model):

    TRAINER = BostonTrainer
    PREDICTOR = BostonPredictor

The dataset can then be generated using the following command:

$ ml2p dataset generate boston-20200901 --model-type boston

Training a model

You’ll need to start by implementing a subclass of ml2p.core.ModelTrainer. Your subclass needs to define a .train(…) method that will load the training set, train the model, and save it.

A simple implementation for the Boston house price model can be found in model.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# -*- coding: utf-8 -*-

""" A model for predicting Boston house prices (part of the ML2P tutorial).
"""

import jsonpickle
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from ml2p.core import Model, ModelDatasetGenerator, ModelPredictor, ModelTrainer

from .create_boston_prices_csv import write_boston_csv


class BostonDatasetGenerator(ModelDatasetGenerator):
    def generate(self):
        """Generate and store the dataset."""
        write_boston_csv("house-prices.csv")
        self.upload_to_s3("house-prices.csv")


class BostonTrainer(ModelTrainer):
    def train(self):
        """Train the model."""
        training_channel = self.env.dataset_folder()
        training_csv = str(training_channel / "house-prices.csv")
        df = pd.read_csv(training_csv)
        y = df["target"]
        X = df.drop(columns="target")
        features = sorted(X.columns)
        X = X[features]
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
        model = LinearRegression().fit(X_train, y_train)
        with (self.env.model_folder() / "boston-model.json").open("w") as f:
            f.write(jsonpickle.encode({"model": model, "features": features}))


class BostonPredictor(ModelPredictor):
    def setup(self):
        """Load the model."""
        with (self.env.model_folder() / "boston-model.json").open("r") as f:
            data = jsonpickle.decode(f.read())
            self.model = data["model"]
            self.features = data["features"]

    def result(self, data):
        """Perform a prediction on the given data and return the result.

        :param dict data:
            The data to perform the prediction on.

        :returns dict:
            The result of the prediction.
        """
        X = pd.DataFrame([data])
        X = X[self.features]
        price = self.model.predict(X)[0]
        return {"predicted_price": price}


class BostonModel(Model):

    TRAINER = BostonTrainer
    PREDICTOR = BostonPredictor

The training data should be read from self.env.dataset_folder(). This is the folder that SageMaker will load your training dataset into.

Once the model is trained, you should write your output files to self.env.model_folder(). SageMaker will read the contents of this folder once training has finished and store them as a .tar.gz file in S3.

Before you train your model in SageMaker, you can try it locally as shown in local.py:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# -*- coding: utf-8 -*-

""" Train the Boston house prices model on your local machine.
"""

import pandas as pd
from ml2p.core import LocalEnv
import model


def train(env):
    """ Train and save the model locally. """
    trainer = model.BostonModel().trainer(env)
    trainer.train()


def predict(env):
    """ Load a model and make predictions locally. """
    predictor = model.BostonModel().predictor(env)
    predictor.setup()
    data = pd.read_csv("house-prices.csv")
    house = dict(data.iloc[0])
    del house["target"]
    print("Making a prediction for:")
    print(house)
    result = predictor.invoke(house)
    print("Prediction:")
    print(result)


if __name__ == "__main__":
    env = LocalEnv(".", "ml2p.yml")
    train(env)
    predict(env)

ML2P provides ml2p.core.LocalEnv which you can use to emulate a real SageMaker environment. SageMaker will read the training data from input/data/training/ so you will need to place a copy of house-prices.csv there for the script to run successfully.

Later in the tutorial you will learn how to download a dataset directly from S3 for use in a local environment.

Once your model works locally, you are ready to train it in SageMaker by creating a training job with:

$ ml2p training-job create boston-train boston-20200901 --model-type boston

The first argument is the name of the training job, the second is the name of the dataset. You will need to have uploaded some training data. The –model-type argument is optional – if you have only a single model defined in ml2p.yml, ML2P will automatically select that one for you.

Wait for your training job to finish. To check up on it you can run:

$ ml2p training-job wait boston-train  # wait for job to finish
$ ml2p training-job describe boston-train  # inspect job

Once your training job is done, there is one more step. The training job records the trained model parameters, but we also need to specify the Docker image that should be used along with those parameters. We do this by creating a SageMaker model from the output of the training job:

$ ml2p model create boston-model boston-train --model-type boston

The first argument is the name of the model to create, the second is the training job the model should be created from.

The Docker image to use is read from the image parameter in ml2p.yml so you don’t have to specify it here.

The model is just an object in SageMaker – it doesn’t run any instances – so it will be created immediately.

Now it’s time to deploy your model by creating an endpoint for it!

Deploying a model

To deploy a model you’ll need to implement a subclass of ml2p.core.ModelPredictor.

You might have seen the implementation for the Boston house price model in model.py while looking at the code for training, but here it is again:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
# -*- coding: utf-8 -*-

""" A model for predicting Boston house prices (part of the ML2P tutorial).
"""

import jsonpickle
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

from ml2p.core import Model, ModelDatasetGenerator, ModelPredictor, ModelTrainer

from .create_boston_prices_csv import write_boston_csv


class BostonDatasetGenerator(ModelDatasetGenerator):
    def generate(self):
        """Generate and store the dataset."""
        write_boston_csv("house-prices.csv")
        self.upload_to_s3("house-prices.csv")


class BostonTrainer(ModelTrainer):
    def train(self):
        """Train the model."""
        training_channel = self.env.dataset_folder()
        training_csv = str(training_channel / "house-prices.csv")
        df = pd.read_csv(training_csv)
        y = df["target"]
        X = df.drop(columns="target")
        features = sorted(X.columns)
        X = X[features]
        X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
        model = LinearRegression().fit(X_train, y_train)
        with (self.env.model_folder() / "boston-model.json").open("w") as f:
            f.write(jsonpickle.encode({"model": model, "features": features}))


class BostonPredictor(ModelPredictor):
    def setup(self):
        """Load the model."""
        with (self.env.model_folder() / "boston-model.json").open("r") as f:
            data = jsonpickle.decode(f.read())
            self.model = data["model"]
            self.features = data["features"]

    def result(self, data):
        """Perform a prediction on the given data and return the result.

        :param dict data:
            The data to perform the prediction on.

        :returns dict:
            The result of the prediction.
        """
        X = pd.DataFrame([data])
        X = X[self.features]
        price = self.model.predict(X)[0]
        return {"predicted_price": price}


class BostonModel(Model):

    TRAINER = BostonTrainer
    PREDICTOR = BostonPredictor

The .setup() method is called only once when starting up a prediction instance. It should read the model from self.env.model_folder() – SageMaker will have placed them in the same location where they were stored while running .train(). Other kinds of setup can be done in this function too if you need to.

The .result(data) method is called when a prediction needs to be made. It will be passed the data that was sent to the prediction API endpoint (usually a dictionary with the features as the keys) and should return the prediction.

As you can see in local.py, .result() is usually not called directly. Instead, when a prediction needs to be made, ML2P will call .invoke(), which will then call .result() and add some metadata to the result before returning it.

If you ran local.py earlier, you’ve already successfully run a local prediction.

Once you’re ready to deploy your model, you can create an endpoint by running:

$ ml2p endpoint create boston-endpoint --model-name boston-model

The first argument is the name of the endpoint to create, the second is the name of the model to create the endpoint from.

Note that endpoints can be quite expensive to run, so check the pricing for the instance type you have specified before pressing enter!

Setting up the endpoint takes awhile. To check up on it you can run:

$ ml2p endpoint wait boston-endpoint  # wait for endpoint to be ready
$ ml2p endpoint describe boston-endpoint  # inspect endpoint

Once the endpoint is ready, your model is deployed!

You can make a test prediction using:

$ ml2p endpoint invoke boston-endpoint '{"CRIM": 0.00632, "ZN": 18.0, "INDUS": 2.31, "CHAS": 0.0, "NOX": 0.5379999999999999, "RM": 6.575, "AGE": 65.2, "DIS": 4.09, "RAD": 1.0, "TAX": 296.0, "PTRATIO": 15.3, "B": 396.9, "LSTAT": 4.98}'

Congratulations! You have trained and deployed your first model using ML2P!

Security considerations

ML2P runs inside SageMaker, so authentication and authorization of prediction requests is managed using AWS IAM profiles and roles, but there are still some important things to consider:

Roles

The role in ml2p.yml defines the permissions your training jobs and endpoints will assume while they run. Best practice is to have a role specific to each ML2P project and for that role to have only the permissions it requires.

VPCs

By default, SageMaker instances run outside of any AWS VPC. This means that the instances access other AWS services (e.g. downloading training data or a stored model from S3) via the public internet address of the service (the connection is encrypted and authenticated as usual) and that they have no special access to any other services you might be running inside a VPC.

You can attach your SageMaker instances to a VPC by specifying a vpc_config:

vpc_config:
  security_groups:
    - "sg-XXXX"
  subnets:
    - "net-YYYY"

This will do two things.

Firstly, it will allow your SageMaker instances to access instances within the VPC (according to the subnet and security group rules).

Secondly, it will prevent your SageMaker instances from accessing the public internet (unless allowed to by the security group or subnet rules). This second point means you may have to configure a VPC Endpoint to allow your SageMaker instances to access other AWS services such as S3.

You can read more on how to Give Endpoints Access to Resources in Your VPC and how to Give Training Jobs Access to Resources in Your VPC in the AWS SageMaker documentation.

Prediction web requests

While your SageMaker instance is likely protected from unauthorised access, like any web API care should be taken when handling untrusted data. This includes any fields passed to the model that can be manipulated by an untrusted party (including, for example, emails or other text from customers).

Working with models locally

At times it may be convenient to work with ML2P models on a local machine, rather than within SageMaker. ML2P supports both training models locally and loading models trained in SageMaker for local analysis.

In either case, first create a local environment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# set up a connection to AWS, specifying an appropriate AWS profile name:
import boto3
session = boto3.session.Session(profile_name="aws-profile")

# create a local environment
from ml2p.core import LocalEnv
env = LocalEnv(".", "./ml2p.yml", session)

# import your ml2p model class:
from model import BostonModel

The first argument to LocalEnv is the local folder to store the environment in, and the second is the path to the ml2p.yml config file for the project.

The third argument, session, is an optional boto3 session and is only needed if you wish to download datasets or models from S3 to your local environment.

To download a dataset from S3 into the local environment used:

env.download_dataset("dataset-name")

If you prefer not to download a dataset, you can also copy a local file into:

input/data/training/

For example, for this tutorial it may be useful to copy the house-prices.csv training file into this folder using:

$ mkdir -p input/data/training/
$ cp house-prices.csv input/data/training/

Once you have a dataset you can train a model locally using:

env.clean_model_folder()
trainer = BostonModel().trainer(env)
trainer.train()

The first line, env.clean_model_folder() just deletes any old files created by previous local training runs.

You can list the model files created during training using:

$ ls model/

If you have already trained a model in SageMaker with ml2p create training-job and would like to examine it locally you can download it into the model folder by running:

env.download_model("training-job-name")

Once you have a model available locally, either by training it locally or by downloading it, you can make predictions with:

predictor = BostonModel().predictor(env)
predictor.setup()
predictor.invoke(data)

Happy local analyzing and debugging!