tensorflow与kubernetes/docker结合运用实践ITeye - 凯发娱乐

tensorflow与kubernetes/docker结合运用实践ITeye

2019-01-10 21:39:07 | 作者: 向山 | 标签: 运用,练习,效劳 | 浏览: 2941

tensorflow

tensorflow是谷歌依据DistBelief进行研制的第二代人工智能学习体系,其命名来源于自身的运转原理。Tensor(张量)意味着N维数组,Flow(流)意味着依据数据流图的核算,TensorFlow为张量从图象的一端流动到另一端核算进程。TensorFlow是将杂乱的数据结构传输至人工智能神经网中进行剖析和处理进程的体系。

tensorflow可在小到一部智能手机、大到数千台数据中心效劳器的各种设备上运转。本文首要讨论的是tensorflow在大规模容器上运转的一种计划。

tensorflow作为深度学习的结构,其关于数据的处理能够分为练习、验证、测验、效劳几种。一般来说,练习是用指来练习模型,验证首要用以检验所练习出来的模型的正确性和是否过拟合。测验是核算黑盒数据关于练习的模型进行测验,然后评判模型的准确率。效劳是指运用现已完结的练习模型供给效劳。这儿为了简化,将处理分为了练习和效劳两种。

练习首要是指从给定练习的程序和练习数据集,用以生成练习的模型。练习完结的模型能够经过存储形成为checkpoints文件。

验证、测验、效劳通通归一到效劳,其首要流程是运用已有的模型,关于数据集进行处理。

tensorflow练习 in kubernetes

关于tensorflow练习的支撑,kubernetes能够经过创立多个pod来进行支撑。tensorflow分布式能够经过拟定parameters效劳器(ps参数效劳器)和worker效劳器进行。

首要ps是整个练习集群的参数效劳器,保存模型的Variable,worker是核算模型梯度的节点,得到的梯度向量会交付给ps更新模型。in-graph与between-graph对应,但两者都能够完成同步练习和异步练习,in-graph指整个集群由一个client来构建graph,而且由这个client来提交graph到集群中,其他worker只负责处理梯度核算的使命,而between-graph指的是一个集群中多个worker能够创立多个graph,但由于worker运转的代码相同因而构建的graph也相同,而且参数都保存到相同的ps中确保练习同一个模型,这样多个worker都能够构建graph和读取练习数据,合适大数据场景。同步练习和异步练习差异在于,同步练习每次更新梯度需求堵塞等候一切worker的成果,而异步练习不会有堵塞,练习的功率更高,在大数据和分布式的场景下一般运用异步练习。----TensorFlow深度学习

我运用rc创立多个ps和worker效劳器。

gcr.io/tensorflow/tensorflow:latest镜像是tensorflow供给的官网镜像,运用CPU进行核算。运用GPU核算的版别下文再行介绍。

[root@A01-R06-I184-22 yaml]# cat ps.yaml 
apiVersion: v1
kind: ReplicationController
metadata:
 name: tensorflow-ps-rc
spec:
 replicas: 2
 selector:
 name: tensorflow-ps
 template:
 metadata:
 labels:
 name: tensorflow-ps
 role: ps
 spec:
 containers:
 - name: ps
 image: gcr.io/tensorflow/tensorflow:latest
 ports:
 - containerPort: 2222
[root@A01-R06-I184-22 yaml]# cat worker.yaml 
apiVersion: v1
kind: ReplicationController
metadata:
 name: tensorflow-worker-rc
spec:
 replicas: 2
 selector:
 name: tensorflow-worker
 template:
 metadata:
 labels:
 name: tensorflow-worker
 role: worker
 spec:
 containers:
 - name: worker
 image: gcr.io/tensorflow/tensorflow:latest
 ports:
 - containerPort: 2222

之后为ps和worker别离创立效劳。

[root@A01-R06-I184-22 yaml]# cat ps-srv.yaml
apiVersion: v1
kind: Service
metadata:
 labels:
 name: tensorflow-ps
 role: service
 name: tensorflow-ps-service
spec:
 ports:
 - port: 2222
 targetPort: 2222
 selector:
 name: tensorflow-ps
[root@A01-R06-I184-22 yaml]# cat worker-srv.yaml
apiVersion: v1
kind: Service
metadata:
 labels:
 name: tensorflow-worker
 role: service
 name: tensorflow-wk-service
spec:
 ports:
 - port: 2222
 targetPort: 2222
 selector:
 name: tensorflow-worker

咱们能够经过检查service来检查对应的容器的ip。

[root@A01-R06-I184-22 yaml]# kubectl describe service tensorflow-ps-service 
Name: tensorflow-ps-service
Namespace: default
Labels: name=tensorflow-ps,role=service
Selector: name=tensorflow-ps
Type: ClusterIP
IP: 10.254.170.61
Port: unset 2222/TCP
Endpoints: 4.0.84.3:2222,4.0.84.4:2222
Session Affinity: None
No events.
[root@A01-R06-I184-22 yaml]# kubectl describe service tensorflow-wk-service 
Name: tensorflow-wk-service
Namespace: default
Labels: name=tensorflow-worker,role=service
Selector: name=tensorflow-worker
Type: ClusterIP
IP: 10.254.70.9
Port: unset 2222/TCP
Endpoints: 4.0.84.5:2222,4.0.84.6:2222
Session Affinity: None
No events.

这儿我运用deep_recommend_system来进行分布式的试验。

在pod中先下载对应的deep_recommend_system的代码。

curl https://codeload.github.com/tobegit3hub/deep_recommend_system/zip/master -o drs.zip
unzip drs.zip
cd deep_recommend_system-master/distributed/

在ps的其间一个容器(4.0.84.3)中履行发动ps效劳器的使命:

root@tensorflow-ps-rc-b5d6g:/notebooks/deep_recommend_system-master/distributed# nohup python cancer_classifier.py --ps_hosts=4.0.84.3:2222,4.0.84.3:2223 --worker_hosts=4.0.84.5:2222,4.0.84.6:2222 --job_name=ps --task_index=0 log1 
[1] 502
root@tensorflow-ps-rc-b5d6g:/notebooks/deep_recommend_system-master/distributed# nohup: ignoring input and redirecting stderr to stdout
root@tensorflow-ps-rc-b5d6g:/notebooks/deep_recommend_system-master/distributed# nohup python cancer_classifier.py --ps_hosts=4.0.84.3:2222,4.0.84.4:2223 --worker_hosts=4.0.84.5:2222,4.0.84.6:2222 --job_name=ps --task_index=1 log2 
[2] 603
root@tensorflow-ps-rc-b5d6g:/notebooks/deep_recommend_system-master/distributed# nohup: ignoring input and redirecting stderr to stdout

这儿我测验运用两个pod别离做ps效劳器,可是总是报core dump的过错。官网也有相似过错,未能处理,估测原因可能是复用了某个设备的原因(两个pod都在同一个宿主机上)。运用一个pod作为两个ps效劳器即无问题。

在worker两个容器中别离履行:

root@tensorflow-worker-rc-vznvt:/notebooks/deep_recommend_system-master/distributed# nohup python cancer_classifier.py --ps_hosts=4.0.84.3:2222,4.0.84.3:2223 --worker_hosts=4.0.84.5:2222,4.0.84.6:2222 --job_name=worker --task_index=0 log 
***********************
root@tensorflow-worker-rc-cpnt7:/notebooks/deep_recommend_system-master/distributed# nohup python cancer_classifier.py --ps_hosts=4.0.84.3:2222,4.0.84.3:2223 --worker_hosts=4.0.84.5:2222,4.0.84.6:2222 --job_name=worker --task_index=1 log 

之后在worker效劳器上的checkpoint文件夹中能够检查核算模型的中心保存成果。

root@tensorflow-worker-rc-vznvt:/notebooks/deep_recommend_system-master/distributed# ll checkpoint/
total 840
drwxr-xr-x 2 root root 4096 Oct 10 15:45 ./
drwxr-xr-x 3 root root 76 Oct 10 15:18 ../
-rw-r--r-- 1 root root 0 Sep 23 14:27 .gitkeeper
-rw-r--r-- 1 root root 270 Oct 10 15:45 checkpoint
-rw-r--r-- 1 root root 86469 Oct 10 15:45 events.out.tfevents.1476113854.tensorflow-worker-rc-vznvt
-rw-r--r-- 1 root root 248875 Oct 10 15:37 graph.pbtxt
-rw-r--r-- 1 root root 2229 Oct 10 15:42 model.ckpt-1172
-rw-r--r-- 1 root root 94464 Oct 10 15:42 model.ckpt-1172.meta
-rw-r--r-- 1 root root 2229 Oct 10 15:43 model.ckpt-1422
-rw-r--r-- 1 root root 94464 Oct 10 15:43 model.ckpt-1422.meta
-rw-r--r-- 1 root root 2229 Oct 10 15:44 model.ckpt-1670
-rw-r--r-- 1 root root 94464 Oct 10 15:44 model.ckpt-1670.meta
-rw-r--r-- 1 root root 2229 Oct 10 15:45 model.ckpt-1921
-rw-r--r-- 1 root root 94464 Oct 10 15:45 model.ckpt-1921.meta
-rw-r--r-- 1 root root 2229 Oct 10 15:41 model.ckpt-921
-rw-r--r-- 1 root root 94464 Oct 10 15:41 model.ckpt-921.meta
tensorflow gpu支撑 tensorflow gpu in docker

docker能够经过供给gpu设备到容器中。nvidia官方供给了nvidia-docker的一种方法,其用nvidia-docker的指令行替代了docker的指令行来运用GPU。

nvidia-docker run -it -p 8888:8888 gcr.io/tensorflow/tensorflow:latest-gpu

这种方法关于docker侵入较多,因而nvidia还供给了一种nvidia-docker-plugin的方法。其运用流程如下:

首要在宿主机发动nvidia-docker-plugin:

[root@A01-R06-I184-22 nvidia-docker]# ./nvidia-docker-plugin 
./nvidia-docker-plugin | 2016/10/10 00:01:12 Loading NVIDIA unified memory
./nvidia-docker-plugin | 2016/10/10 00:01:12 Loading NVIDIA management library
./nvidia-docker-plugin | 2016/10/10 00:01:17 Discovering GPU devices
./nvidia-docker-plugin | 2016/10/10 00:01:18 Provisioning volumes at /var/lib/nvidia-docker/volumes
./nvidia-docker-plugin | 2016/10/10 00:01:18 Serving plugin API at /run/docker/plugins
./nvidia-docker-plugin | 2016/10/10 00:01:18 Serving remote API at localhost:3476

能够看到nvidia-docker-plugin监听了3486端口。然后在宿主机上运转docker run -ti curl -s http://localhost:3476/v1.0/docker/cli -p 8890:8888 gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash指令以创立tensorflow的GPU容器。并能够在容器中验证是否能正常import tensorflow。

[root@A01-R06-I184-22 ~]# docker run -ti `curl -s http://localhost:3476/v1.0/docker/cli` -p 8890:8888 gcr.io/tensorflow/tensorflow:latest-gpu /bin/bash
root@7087e1f99062:/notebooks# python
Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 import tensorflow
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
 

能够看到tensorflow现已能够正确加载了。

这儿相同运用deep_recommend_system进行测验。在pod中先下载对应的deep_recommend_system的代码。

curl https://codeload.github.com/tobegit3hub/deep_recommend_system/zip/master -o drs.zip
unzip drs.zip
cd deep_recommend_system-master/

然后运用GPU0和1进行核算。

root@087e1f99062:/notebooks/deep_recommend_system-master# export CUDA_VISIBLE_DEVICES=0,1 //用以指定运用的GPU的编号
root@087e1f99062:/notebooks/deep_recommend_system-master# python cancer_classifier.py 
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Use the model: wide_and_deep
Use the optimizer: adagrad
Use the model: wide_and_deep
Use the model: wide_and_deep
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: Tesla K20c
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:02:00.0
Total memory: 4.9GiB
Free memory: 4.1GiB
W tensorflow/stream_executor/cuda/cuda_driver.cc:572] creating context when one is currently active; existing: x24402e0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 1 with properties: 
name: Tesla K20c
major: 3 minor: 5 memoryClockRate (GHz) 0.7055
pciBusID 0000:04:00.0
Total memory: 4.9GiB
Free memory: 4.1GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 1 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0: Y Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 1: Y Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) - (device: 0, name: Tesla K20c, pci bus id: 0000:02:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:1) - (device: 1, name: Tesla K20c, pci bus id: 0000:04:00.0)
[0:00:34.437041] Step: 100, loss: 2.97578692436, accuracy: 0.77734375, auc: 0.763736724854
[0:00:32.162310] Step: 200, loss: 1.81753754616, accuracy: 0.7890625, auc: 0.788772583008
[0:00:37.559177] Step: 300, loss: 1.26066374779, accuracy: 0.865234375, auc: 0.811861813068
[0:00:36.082163] Step: 400, loss: 0.920016527176, accuracy: 0.8359375, auc: 0.820605039597

相同我能够运用nvidia-smi检查GPU运用情况

[root@A01-R06-I184-22 ~]# nvidia-smi 
Tue Oct 11 00:10:28 2016 
+------------------------------------------------------+ 
| NVIDIA-SMI 352.39 Driver Version: 352.39 | 
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20c Off | 0000:02:00.0 Off | 0 |
| 30% 26C P0 48W / 225W | 4540MiB / 4799MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20c Off | 0000:04:00.0 Off | 0 |
| 30% 31C P0 48W / 225W | 4499MiB / 4799MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K20c Off | 0000:83:00.0 Off | 0 |
| 30% 25C P8 26W / 225W | 11MiB / 4799MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K20c Off | 0000:84:00.0 Off | 0 |
| 30% 24C P8 25W / 225W | 11MiB / 4799MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 132460 C python 4524MiB |
| 1 132460 C python 4484MiB |
+-----------------------------------------------------------------------------+

nvidia-docker-plugin作业原理是是其供给了一个API

[root@A01-R06-I184-22 ~]# curl -s http://localhost:3476/v1.0/docker/cli
--volume-driver=nvidia-docker --volume=nvidia_driver_352.39:/usr/local/nvidia:ro --device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/nvidia1 --device=/dev/nvidia2 --device=/dev/nvidia3

能够看到curl -s http://localhost:3476/v1.0/docker/cli指令实践是供给了docker run时分的一些必要参数。其间包括把gpu设备映射进入容器中的部分(--device=/dev/nvidiactl --device=/dev/nvidia-uvm --device=/dev/nvidia0 --device=/dev/nvidia1 --device=/dev/nvidia2 --device=/dev/nvidia3),还包括了将nvidia_driver_352.39存储映射进入容器的部分。

接下来咱们关于nvidia_driver_352.39进行剖析

[root@A01-R06-I184-22 ~]# docker volume ls
DRIVER VOLUME NAME
nvidia-docker nvidia_driver_352.39
[root@A01-R06-I184-22 ~]# docker volume inspect nvidia_driver_352.39 
 "Name": "nvidia_driver_352.39",
 "Driver": "nvidia-docker",
 "Mountpoint": "/var/lib/nvidia-docker/volumes/nvidia_driver/352.39"
]

能够看到该存储其实仅仅一个文件夹。对文件夹/var/lib/nvidia-docker/volumes/nvidia_driver/352.39/进行剖析

[root@A01-R06-I184-22 ~]# tree -L 3 /var/lib/nvidia-docker/volumes/nvidia_driver/352.39/
/var/lib/nvidia-docker/volumes/nvidia_driver/352.39/
├── bin
│   ├── nvidia-cuda-mps-control
│   ├── nvidia-cuda-mps-server
│   ├── nvidia-debugdump
│   ├── nvidia-persistenced
│   └── nvidia-smi
├── lib
│   ├── libcuda.so - libcuda.so.352.39
│   ├── libcuda.so.1 - libcuda.so.352.39
│   ├── libcuda.so.352.39
│   ├── libGL.so.1 - libGL.so.352.39
│   ├── libGL.so.352.39
│   ├── libnvcuvid.so.1 - libnvcuvid.so.352.39
│   ├── libnvcuvid.so.352.39
│   ├── libnvidia-compiler.so.352.39
│   ├── libnvidia-eglcore.so.352.39
│   ├── libnvidia-encode.so.1 - libnvidia-encode.so.352.39
│   ├── libnvidia-encode.so.352.39
│   ├── libnvidia-fbc.so.1 - libnvidia-fbc.so.352.39
│   ├── libnvidia-fbc.so.352.39
│   ├── libnvidia-glcore.so.352.39
│   ├── libnvidia-glsi.so.352.39
│   ├── libnvidia-ifr.so.1 - libnvidia-ifr.so.352.39
│   ├── libnvidia-ifr.so.352.39
│   ├── libnvidia-ml.so.1 - libnvidia-ml.so.352.39
│   ├── libnvidia-ml.so.352.39
│   ├── libnvidia-opencl.so.1 - libnvidia-opencl.so.352.39
│   ├── libnvidia-opencl.so.352.39
│   ├── libvdpau_nvidia.so.1 - libvdpau_nvidia.so.352.39
│   └── libvdpau_nvidia.so.352.39
└── lib64
 ├── libcuda.so - libcuda.so.352.39
 ├── libcuda.so.1 - libcuda.so.352.39
 ├── libcuda.so.352.39
 ├── libGL.so.1 - libGL.so.352.39
 ├── libGL.so.352.39
 ├── libnvcuvid.so.1 - libnvcuvid.so.352.39
 ├── libnvcuvid.so.352.39
 ├── libnvidia-compiler.so.352.39
 ├── libnvidia-eglcore.so.352.39
 ├── libnvidia-encode.so.1 - libnvidia-encode.so.352.39
 ├── libnvidia-encode.so.352.39
 ├── libnvidia-fbc.so.1 - libnvidia-fbc.so.352.39
 ├── libnvidia-fbc.so.352.39
 ├── libnvidia-glcore.so.352.39
 ├── libnvidia-glsi.so.352.39
 ├── libnvidia-ifr.so.1 - libnvidia-ifr.so.352.39
 ├── libnvidia-ifr.so.352.39
 ├── libnvidia-ml.so.1 - libnvidia-ml.so.352.39
 ├── libnvidia-ml.so.352.39
 ├── libnvidia-opencl.so.1 - libnvidia-opencl.so.352.39
 ├── libnvidia-opencl.so.352.39
 ├── libnvidia-tls.so.352.39
 ├── libvdpau_nvidia.so.1 - libvdpau_nvidia.so.352.39
 └── libvdpau_nvidia.so.352.39
3 directories, 52 files

能够看到这个文件夹其实首要包括的是关于GPU显卡的一些库、包和一些必要的可履行文件。这些文件实践上也是从宿主机上由nvidia-docker-plugin搜集拷贝到该文件夹中的,用以供给给容器,便利容器关于GPU的运用。

kubernetes与GPU

kubernetes1.3现已引入了GPU调度支撑,可是现在是试验性质。

tensorflow效劳

Serving Inception Model with TensorFlow Serving and Kubernetes中关于tensorflow效劳与kubernetes结合运用的方法进行了介绍。

其根本的作业方法是首要依据现已练习好的模型,制作成能够对外供给效劳的镜像inception_serving。然后运用该镜像创立rc,并对应树立service。

$ kubectl get rc
CONTROLLER CONTAINER(S) IMAGE(S) SELECTOR REPLICAS AGE
inception-controller inception-container gcr.io/tensorflow-serving/inception worker=inception-pod 3 20s
$ kubectl get svc
NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE
inception-service 10.15.242.244 146.148.88.232 9000/TCP worker=inception-pod 3m
$ kubectl describe svc inception-service
Name: inception-service
Namespace: default
Labels: none 
Selector: worker=inception-pod
Type: LoadBalancer
IP: 10.15.242.244
LoadBalancer Ingress: 146.148.88.232
Port: unnamed 9000/TCP
NodePort: unnamed 32006/TCP
Endpoints: 10.12.2.4:9000,10.12.4.4:9000,10.12.4.5:9000
Session Affinity: None
Events:
 FirstSeen LastSeen Count From SubobjectPath Reason Message
 ───────── ──────── ───── ──── ───────────── ────── ───────
 4m 3m 2 {service-controller } CreatingLoadBalancer Creating load balancer
 3m 2m 2 {service-controller } CreatedLoadBalancer Created load balancer

用户恳求直接经过EXTERNAL_IP(146.148.88.232:9000)进行效劳拜访。当用户有恳求到来时,kubernetes将恳求分发给10.12.2.4:9000,10.12.4.4:9000,10.12.4.5:9000之一的pod,然后由该pod上供给实践的效劳,然后回来成果。

这一进程本质上来说同供给web效劳(如tomcat的效劳)等是没有多大差异的。kubernetes能够很好的支撑。

存储与康复 Serving Inception Model with TensorFlow Serving and Kubernetes TensorFlow深度学习 deep_recommend_system

 

http://www.cnblogs.com/xuxinkun/p/5983633.html

版权声明
本文来源于网络,版权归原作者所有,其内容与观点不代表凯发娱乐立场。转载文章仅为传播更有价值的信息,如采编人员采编有误或者版权原因,请与我们联系,我们核实后立即修改或删除。

猜您喜欢的文章