The problem
You have trained a model but other people cannot use it conveniently because they will have to clone your code, load the saved weights and re-run it every time they want something out of this model. You google up on how to set this code and saved weights up as a service so that other people can just send a request at a given endpoint and get the model’s output. The most approved answer you find is “Tensorflow Serving”. The only blocker is a lack of clear and concise documentation for saving the model as per Tensorflow Serving’s requirements and setting up the server. To solve this, I am writing about what I learned while figuring out how to set this service up, so that other first timers do not have to go through the same confusion and face the same errors.
I am breaking down this article into three parts:
- Potential issues while saving the trained model
- Server side setup for both GPU and CPU
- Client-side details
The starting point:
I assume you have your code up till the point where you can load saved weights and pass your input to a function to get the outputs from whichever layers of the model you need. In some cases, you may be training a model for a different task, and using the outputs of the hidden layers of the trained model for a different purpose. So, your starting point after training the model would be this function.
from mymodel import get_required_outputmy_input = process_data(my_data)
my_output = get_required_output(my_input)
Saving the model:
Tensorflow has two methods of saving a trained model:
- Saving variables in checkpoint files using tf.train.Saver This will give you a bunch of checkpoint files that will have a structure similar to this:
checkpoint_directory
  |- checkpoint
  |- model.ckpt-935588.data-00000-of-00001
  |- model.ckpt-935588.index
  |- model.ckpt-935588.meta
You can load these files back into your code and rerun it to get the output you want, but Tensorflow Serving works with the model saved using tf.saved_model So if you have these files after your training has completed, you can just write a piece of code that loads these checkpoint files into your model and gives you your desired output. Once you have this piece of code, you can use it to save the model in a format that would work with Tensorflow Serving.
- Saving the model using tf.saved_model requires you to know the exact input and output tensors that you plan to give to and expect in return from the model. You may be using pieces of open source code that you had been treating as black boxes and you may want this service to return the output of some of the hidden layers instead of the final output of the model during its training. It can be confusing to find the tensors you need from a large graph in the open source code that you did not write yourself. This is why it would be convenient to be at the point where you have your function to get the required output ready. Inside this function you should find some lines that look like this:
input_placeholder = tf.placeholder(
  'int32',
  shape=(None, None, input_dimension1)
)
model = MyBigModel(weight_file)
ops = model(input_placeholder)required_output = sess.run(
  ops['required_tensor'],
  feed_dict={input_placeholder: required_input}
)
If this piece of code is giving you the desired output, your input and output tensor info for the purpose of saving this model can be used using tf.saved_model.utils.build_tensor_info as follows:
# get input and output tensor info
input_tensor_info = tf.saved_model.utils.build_tensor_info(input_placeholder)
output_tensor_info = tf.saved_model.utils.build_tensor_info(ops['required_tensor'])
You can then create a signature_def_map from input_tensor_info and output_tensor_info and save the model with the serve tag as follows. Note down the key model_input while creating signature_map below. This very key will be used when sending POST requests the service.
# save the model using tf.saved_model
signature_map = (
  tf.saved_model.signature_def_utils.build_signature_def(
    inputs={'model_input': input_tensor_info},
    outputs={'model_output': output_tensor_info},
    method_name=tf.saved_model
      .signature_constants.PREDICT_METHOD_NAME
))export_path = 'path/to/export/dir'
builder = tf.saved_model.builder.SavedModelBuilder(export_path)
builder.add_meta_graph_and_variables(
  sess,
  [tf.saved_model.tag_constants.SERVING],
  signature_def_map={
    'serving_default': signature_map,
  },
  clear_devices=True
)
builder.save()
The clear_devices flag should be set to True if your tensors and operations had their device field set up. If you leave this flag at its default False value, you may get errors when you try to run this service on a different machine.
The saved model will be exported in the following fashion.
export_dir
  |- saved_model.pb
  |- variables
       |- variables.data-0000-of-0002
       |- variables.index
- If you are using tf.estimator.Estimator or tf.contrib.tpu.TPUEstimatorto train your model, you can export the model by defining an input function like this
def serving_input_fn():
  input_placeholder = tf.placeholder(
    'int32',
    shape=(None, None, input_dimension1)
  )
  input_fn = tf.estimator.export.build_raw_serving_input_receiver_fn({
    'model_input': input_placeholder
  })
  return input_fn
and using it in estimator.export_saved_model
estimator.export_saved_model(export_dir, serving_input_fn, strip_default_attrs=True)
Server side setup:
If your server machine has GPUs, you should use the following steps:
- Install nvidia-docker as per these instructions
- Download the docker file for GPU setup from the tensorflow repository
- Customize the docker file as per your needs. For example, you want the model to be served at port 8010 instead of the default 8501, change the command tensorflow_model_server — port=8500 — rest_api_port=8501 …
- Build the docker image using
$docker build -t abc/tensorflow-serving-gpu -f Dockerfile.gpu .
adding additional arguments as per docker build docs
- Start the service using
$docker run --runtime=nvidia -p 8010:8010 --mount type=bind,source=/path/to/saved/model/versions/,target=/models/model -it --rm abc/tensorflow-serving-gpu:latest
For server machines with no GPU, you can use docker instead of nvidia-docker and the docker file for CPU from tensorflow repository. The docker run command will not need the--runtime=nvidia argument as well.
Client-side processing:
This part is necessary only when you need to do some preprocessing to convert your input data into input arrays for your model. A good example is the conversion of text into an array of token IDs or character IDs for a model’s initial embedding layer. This part will have to be done at the client side before sending requests to your server. 
After you have preprocessed your data into the expected into arrays. You can send POST requests with the following format.
endpoint = 'http://server_ip:port/v1/models/model:predict
requests.post(
  url=endpoint, 
  data=json.dumps(
    {'instances': 
      [{'model_input': input_data.tolist()}]
    }), 
  headers={'Content-Type': 'application/json'}
)
The key instances depends on the method_name you used in your signature_map while saving the model. I used tf.saved_model.signature_constants.PREDICT_METHOD_NAME which requires the key instances . Each instance inside should have the key model_inputthat was used when you created your signature_map This code snippet is consistent with the code snippet of saving the models above, so you can compare and check what these keys should be in your case.
That sums up a minimal process of setting up Tensorflow Serving for any general model! I hope this helps someone understand and setup Tensorflow Serving faster. Let me know in the comments, if this solves some of your problems or if you have better ways of executing some of these steps.
About the Author: Apporv Nandan is a Machine Learning Scientist at Observe.AI

















