GeePS: Scalable deep learning on distributed GPUs with a GPU-specialized parameter server
Notes on a paper:
GeePS is a smart data manager (data migration/movement coordinator) for a very large distributed deep learning training on GPUs. The paper addressed a real need and showed how a problem of distributed learning with GPUs can be solved.
Three primary specializations to a parameter server to enable efficient support of parallel ML applications running on distributed GPUs:
Explicit use of GPU memory for the parameter cache (to increase performance): moving the parameter cache into the GPU memory enables the parameter server client library to perform the data movement steps (between CPU and GPU memory, the updates from the local GPU must still be moved to CPU memory to be sent to other machines, the updates from other machines must still be moved from CPU memory to GPU memory). The client application parameter server executed the data movements in the background, overlapping them with GPU computing activity. In short: use GPU memory as a cache to keep actively used data and store the remaining data in CPU memory.
Batch-based parameter access methods (to increase performance) - which is very suitable for the SIMD style processing in GPU (Single Instruction on Multiple Data at once). They build indexes for faster gather-scatter operations to/from GPU registers - for the model parameters. This makes the parameter server accesses much more efficient for GPU-based training.
Parameter server management of GPU memory on behalf of the application (this expands the range of problem sizes that can be addressed). It swaps the data from the (not really that) limited 5 GB or more GPU memory to the (much larger) CPU memory. The GPU parameter server (GeePS) manages the data buffers.
Asynchronous training has been shown to provide significantly faster convergence rates in the data-parallel CPU-based model training systems - previous CPU based parameter servers (for example, in the laster paper: Parameter server: Scaling distributed machine learning with the parameter server). However, it was shown in this paper that it does not hold with GPU-based learning. First of all, the stall time for BSP (Bulk Synchronous Parallel) is negligible so the slight training throughput speedups achieved with an asynchronous training is not sufficient to overcome the reduced training quality per image processed - thus, using BSP leads to much faster convergence. Second, the theoretical conditions for faster convergence of asynchronous training apply to convex optimization but not to the non-convex optimization, which is used in deep neural networks. There was a big problem to solve, posed by Chilimbi: “Either the GPU must constantly stall while waiting for model parameter updates or the models will likely diverge due to insufficient synchronization”. GeePS was able to reduce the stalls to the minimum (so that they are only slightly worse than for the asynchronous training). Using GeePS, less than 8% of the GPU’s time is lost to stalls (e.g. for communication, synchronization, and data movement), as compared to 65% when using an efficient CPU-based parameter server implementation.
Interesting setup to help distributed the update of parameters in time and between many machines (this is a default configuration in GeePS): parameter data of each layer is stored in a distinct table, the parameter updates of a layer can be sent to the server and propagated to other workers (in the background) before the computation of other layers finish. This is about decoupling the parameter data of different layers. A less efficient setup would be to store all an application’s parameter data in a single table (the parameter updates of all layers would be sent to the server together as a whole batch).
The results are impressive, as expected after harnessing many GPUs. For example, GeePS achieves a higher training throughput with just four GPU machines than a state-of-the-art CPU-only system achieves with 108 machines.
Resource division: this is a good division of resources: the CPU part of a node is for the parameter server while the GPU part of the node is used to do the training of a machine learning model. The GPU memory is the primary memory for the training data (mini-batches) and currently used parameters, the CPU memory is used as the parameter server memory but also as a bigger storage and swapping space for the data that does not fit into the GPU memory.
Avoiding data copying by exchanging pointers to GPU buffers. “The parameter server uses pre-allocated GPU buffers to pass data to an application, rather than copying the parameter data to application provided buffers. When an application wants to update parameter values, it also does so in GPU allocated buffers. The application can also store local non-parameter data (e.g intermediate states) in the parameter server.”
Very nice pinning of “pages” that are needed in both GPU and CPU.
They do not try to mitigate or even address in any way the problem of stragglers.
The main point is that the training of deep nets is effective on GPUs when model, intermediate state and input mini-batch fit into the GPU memory. All the models that they used in their experiments fit all the required data into the GPU 5 GB memory. When the GPU memory limits the scale, the common solution is to use smaller mini-batch sizes to let everything fit in GPU memory. However, there are two problems with small batches: 1) longer training time - the more mini-batches the more time spent on reading and updating the parameter data locally (with larger mini-batches it is amortized over time) 2) the GPU computation is more efficient with a larger mini-batch size.
They don’t address the problem of resilience. The usage of write-back cache implies that the parameters are kept only in cache and if one of the nodes crushes then we can loose part of the model parameters - it is not clear if the model can still converge but with only 4 param servers, it might be rather impossible. Thus, the system focuses primarily on performance without addressing any issues of lack of robust execution and node failures.
It can handle models as large as 20 GB while currently a 13 GB or more in a GPU is a common size. Thus, the system will be obsolete very soon if not expanded to cater to larger models (but is there such a need?).
There is a missing arrow in the graphs between the “Parameter server shard 0” and “Staging memory for parameter cache”. If the worker and parameter server processes are collocated, the exchange of updated parameters does not have to go through the network.
bulk synchronous parallel (BSP) involves three main steps: 1) local computation on many machines 2) communication between the machine 3) reaching a barrier and global step to the next task. In other words, the system includes: 1) components capable of processing and/or local memory transactions (i.e., processors) 2) a network that routes messages between pairs of such components, and 3) a hardware facility that allows for the synchronisation of all or a subset of components.
DeepImage - a specialized system with distributed GPUs connected via RDMA without any involvement of CPUs.
Krizhevsky - did he come up with the idea of dividing the model into many GPUs by placing each layer on a separate GPU? but this would require the expensive exchange of the intermediate data/state for the forward and backward passes.
Double buffering is used by default to maximize overlapping of data movement with computation. I observed a very bad behavior in many DBMSs for data loading/migration/export because of explicit lack of simultaneous computation (parsing & deserialization) and data movement (fetching data from disk for loading or moving the parsed/loaded data from the memory to a disk storage). The data movement and computation are not interleaved in many cases and we have to wait with computation until, for example, the local buffers are dumped to disk.
Most layers have little or no parameter data, and most of the memory is consumed by the intermediate states for neuron activations and error terms. (section 5.3 - artificially shrinking available GPU memory).
Gather - fill the SIMD register with values from different memory locations. Similarly, scatter - write the values in the output SIMD register to different memory locations. Gather-scatter is a type of memory addressing that often arises when addressing vectors in sparse linear algebra operations. It is the vector-equivalent of register indirect addressing, with gather involving indexed reads and scatter indexed writes. Vector processors (and some SIMD units in CPUs) have hardware support for gather-scatter operations, providing instructions such as Load Vector Indexed for gather and Store Vector Indexed for scatter.
GeePS is a parameter server supporting data-parallel model training. In data parallel training, the input data is partitioned among workers on different machines, that collectively update shared model parameters. These parameters themselves may be sharded across machines. “This avoids the excessive communication delays that would arise in model-parallel approaches, in which the model parameters are partitioned among the workers on different machines.”
In the basic parameter server architecture, all state shared among application workers (i.e. the model parameters being learned) is kept in distributed shared memory implemented as a specialized key-value stare called a “parameter server”. An ML application’s workers process their assigned input data and use simple Read and Update methods to fetch or apply a delta to parameter values, leaving the communication and consistency issues to the parameter server.
Client-side caches are also used to serve most operations locally. Many systems include a Clock method to identify a point when a worker’s cached updates should be pushed back to the shared key-value store and its local cache state should be refreshed.
While logically the parameter server is separate to the worker machines, in practice the server-side parameter server state is commonly sharded across the same machines as the worker state. This is especially important for GPU-based ML execution, since the CPU cores on the worker machines are otherwise underused.
When we train neural networks (forward and backward passes): each time only data of two layers are used.
Interface to GeePS managed buffer”
Read - Buffer “allocated” by GeePS, data copied to buffer.
PostRead - Buffer reclaimed
PreUpdate - Buffer “allocated” by GeePS
Update - Updates applied to data. Buffer reclaimed.
CPU PS (Parameter Server) has much overhead of transferring data between GPU/CPU memory in the foreground.
Implementation: GeePS is a C++ library. The ML application worker often runs in a single CPU thread that launches NVIDIA library calls or customized CUDA kernels to perform computations on GPUs, and it calls GeePS functions to access and release GeePS-managed data. The parameter data is sharded across all instances, and cached locally with periodic refresh (e.g. every clock for BSP). GeePS supports BSP, asynchrony, and the Staleness Synchronous Parallel (SSP) model.
Moreover, the GeePS can serve as a cache manager in general. The local non-parameter data (the intermediate state - data exchanged between the layers in the neural network) that probably do not fit into the GPU memory can be spilled to the CPU memory and managed the GeePS.
Parameter servers on CPUs & GPUs:
CPU PS - focused on machine learning in general, on the other hand, Gee PS - focused on deep learning.
deep learning is non-convex - synchrony is required
the CPU PS addresses convex problems - may work with asynchronous updates
mini-batch can cater to the big data - the data is naturally partitioned
the bottleneck of data movement between GPU memory and CPU memory
limited GPU memory - the huge problem that they focused on here
cache also the intermediate data (the data exchanged between neural network layers)
the models are big - billions of parameters
gobs of training data
distribution (scale out)
iterative computation (until we reach a convergence)
traditional algorithms/systems are single node
stalling for data
libraries for fast linear algebra
hypothetically - what-if mode for the query optimizer
build the indexes
run the transactions without any locks
write-ahead logging is exploiting the idea of simultaneous computation and data transfer
all the GPU memory goes to intermediate state (the data exchanged between layers)