I build Tensorflow models and deploy them with Docker on Kubernetes during the day. There are two very painful pitfalls that can increase memory usage up to ~7x and slow down inference up to ~30x running on CPU. In the interest of saving others some time, here they are:
1. CPU issues 😣
When running tensorflow in docker, tensorflow thinks that it owns all the resources on the machine that docker is running on. For example, if you run a docker container on a kubernetes node with 128 cores, tensorflow thinks it can use all 128 cores. This causes massive slow downs – I’ve seen up to 30x slower depending on the network architecture.
When scheduling docker containers on kubernetes nodes, resources are limited per container using cgroups. Tensorflow in docker doesn’t care about cgroups and thinks it can use all resources on the host. For example, tensorflow will set
intra_op_parallelism_threads to 128 by default in the config on a 128 core node. In reality, you may have set the limit to 8 cores per container in the kubernetes deployment! Bad things ensue. 😖
How to fix?
Simply set the tensorflow config with the actual resources allocated to that container. To get CPU resources allocated in a docker container, you can use something like the following:
import math from pathlib import Path def get_cpu_quota_within_docker(): cpu_cores = None cfs_period = Path("/sys/fs/cgroup/cpu/cpu.cfs_period_us") cfs_quota = Path("/sys/fs/cgroup/cpu/cpu.cfs_quota_us") if cfs_period.exists() and cfs_quota.exists(): # we are in a linux container with cpu quotas! with cfs_period.open('rb') as p, cfs_quota.open('rb') as q: p, q = int(p.read()), int(q.read()) # get the cores allocated by dividing the quota # in microseconds by the period in microseconds cpu_cores = math.ceil(q / p) if q > 0 and p > 0 else None return cpu_cores
To set the
tf.ConfigProto, you might do something like this:
import tensorflow as tf import multiprocessing cpu_cores = get_cpu_quota_within_docker() or multiprocessing.cpu_count() config = tf.ConfigProto( inter_op_parallelism_threads=cpu_cores, intra_op_parallelism_threads=cpu_cores)
And if you use keras, you could just set the default tensorflow session using the above
tf.ConfigProto like so:
import keras.backend as K K.set_session(tf.Session(config=config))
2. Memory issues
If you run tensorflow in docker with the default tensorflow config or the one above, you might notice your memory usage increasing on every inference call up to a certain point (for TF 1.10.1 on Ubuntu 16).
InceptionV3 should load in RAM using just ~600MB, but will then take up to ~4GB after ~100 inference calls depending on the hardware and
tf.ConfigProto. I filed a bug in this tensorflow issue, where you can find code to reproduce.
No clue, but I’m hoping to gain some insight once this tensorflow issue is resolved.
How to fix?
After many painful days, I discovered that memory stays flat if you set
Another minor detail
inter_op_parallelism_threads=1 sped up inference calls from 3s to 700ms for a saliency model I was working on, but slowed down InceptionV3 by 50ms on average. ¯\(ツ)/¯
In conclusion, good luck. I hope this post saves someone some time. And remember that tuning tensorflow is highly dependent on the hardware and config you are running on, especially within docker.
Thanks to the team I work on for all the support in debugging these issues!