By Han Zhou (hanzhou) and Shannon Williams (srw)
We want to implement large scale graph mining on GPUs and compare the runtime to the versions available for CPU.
This project is inspired from PEGASUS, which runs graph mining algorithms on the top of Hadoop. The core idea of PEGASUS is converting graph mining problems to a matrix-vector multiplication and using sql operations to achieve the computation. Therefore the primitive operation is GIM-V (Generalized Iterated Matrix-Vector multiplication). Then it is implemented in Hadoop with some optimizations. (See: this paper for details)
Previously, we implemented a sequential version of GIM-V in SQL. We plan to use this as a baseline to measure the improvement of our code.
The challenge of this project is scaling up GPU to work on a large dataset. Local memory is limited, so we need to do many disk reads to compute over the entire graph. We need to efficiently hide the latency of these reads, and also improve the compute time such that, even with the additional latency, we still see a speed-up.
Another challenge would be scaling up to use multiple GPUs. Communication between GPUs will be difficult to implement, and also add further latency. A key point of our project will be analyzing these trade offs.
PEGASUS is an open source graph mining program, so we can directly download it from their website.
cuBLAS is provided by Nvidia to support the basic linear algebra computing. And these operations should have been well optimized. But we don’t know whether we can still use them when we want to implement a scalable version, so we will try both this library and functions developed by ourselves.
At our demo, we plan to present several graphs comparing the runtime of our 3 algorithms on multiple input graphs for multiple graph mining algorithms. This should clearly show when each algorithm is superior to the others.
We probably won’t show a demo of our programs actually running, because it wouldn’t look very interesting, and it might take a while to run.
We want to discover if the improved compute time of GPUs is worth the possible overhead of moving large quantities of data, such as those found in modern day graphs.
We plan to use the Latedays cluster to run our code, which will use Tesla K40 GPUs. The K40 are advanced GPUs used for scientific computing. Also, Latedays has a large memory and many multi-core CPUs making it ideal for parallelizing on the CPU as well. For consistency, we will also run the sequential SQL version on Latedays. In contrast, PEGASUS uses Hadoop cloud computing to run, so we can just run this with AWS.
Week 0: (March 30 - April 5)
Week 1: (April 6 - April 12)
Week 2: (April 13 - April 19)
Week 3 - 4: (April 20 - May 3)
Week 5: (May 3 - May 10)
Having trouble with Pages? Check out the documentation at https://help.github.com/pages or contact support@github.com and we’ll help you sort it out.