ecosystem/marathon at master · tensorflow/ecosystem · GitHub

Run @tensorflow on @ApacheMesos with @dcos!

  • The section covers instructions on how to write your training program and build your docker image.
  • Write your own Docker file which simply copies your training program into the image and optionally specify an entrypoint.
  • Write your own training program.
  • An example is located in docker/Dockerfile or docker/Dockerfile.hdfs if you need the HDFS support.
  • The Marathon config is generated from a Jinja template where you need to customize your own cluster configuration in the file header.

ecosystem – Integration of TensorFlow with other open-source frameworks

@KarlKFI: Run @tensorflow on @ApacheMesos with @dcos!

Before you start, you need to set up a Mesos cluster with Marathon installed and Docker Containerizer and Mesos-DNS enabled. It is also preferable to set up some shared storage such as HDFS in the cluster. All of these could be easily installed and configured with the help of DC/OS. You need to remember the master target, DNS domain and HDFS namenode which are needed to bring up the TensorFlow cluster.

This section covers instructions on how to write your training program and build your docker image.

Write your own training program. This program must accept

worker_hosts

ps_hosts

job_name

task_index

as command line flags which are then parsed to build

. After that, the task either joins with the server or starts building graphs. Please refero to the main page for code snippets and description of between-graph replication. An example can be found in

docker/mnist.py

In the case of large training input is needed by the training program, we recommend copying your data to shared storage first and then point each worker to the data. You may want to add a flag called

data_dir

. Please refer to the adding flags section for adding this flag into the marathon config.

Write your own Docker file which simply copies your training program into the image and optionally specify an entrypoint. An example is located in

or

if you need the HDFS support. TensorBoard can also use the same image, but with a different entry point.

Build your docker image, push it to a docker repository:

Please refer to docker images page for best practices of building docker images.

The Marathon config is generated from a Jinja template where you need to customize your own cluster configuration in the file header.

Copy over the template file:

Edit the

file. You need to specify the

name

image_name

train_dir

and optionally change number of worker and ps replicas. The

train_dir

must point to the directory on shared storage if you would like to use TensorBoard or sharded checkpoint.

Generate the Marathon json config:

by default:

example would print losses for each step and final loss when training is done.

with your browser or find out its IP address from the DC/OS web console.

Let’s suppose you would like to add a flag called

into the rendered config. Before rendering the template, make following changes:

Add a variable in the header of

Add the flag into the

args

section of the template:

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.

ecosystem/marathon at master · tensorflow/ecosystem · GitHub