Large Scale Graph Mining on GPU


Project maintained by hanzhou and srw

By Han Zhou (hanzhou) and Shannon Williams (srw)

Main Page

Summary

We want to implement large scale graph mining on GPUs and compare the runtime to the versions available for CPU.

Background

This project is inspired from PEGASUS, which runs graph mining algorithms on the top of Hadoop. The core idea of PEGASUS is converting graph mining problems to a matrix-vector multiplication and using sql operations to achieve the computation. Therefore the primitive operation is GIM-V (Generalized Iterated Matrix-Vector multiplication). Then it is implemented in Hadoop with some optimizations. (See: this paper for details)

Previously, we implemented a sequential version of GIM-V in SQL. We plan to use this as a baseline to measure the improvement of our code.

Challenge

The challenge of this project is scaling up GPU to work on a large dataset. Local memory is limited, so we need to do many disk reads to compute over the entire graph. We need to efficiently hide the latency of these reads, and also improve the compute time such that, even with the additional latency, we still see a speed-up.

Another challenge would be scaling up to use multiple GPUs. Communication between GPUs will be difficult to implement, and also add further latency. A key point of our project will be analyzing these trade offs.

Resources

PEGASUS is an open source graph mining program, so we can directly download it from their website.

cuBLAS is provided by Nvidia to support the basic linear algebra computing. And these operations should have been well optimized. But we don’t know whether we can still use them when we want to implement a scalable version, so we will try both this library and functions developed by ourselves.

Goals and Deliverables

PLAN TO ACHIEVE

HOPE TO ACHIEVE

DEMO

At our demo, we plan to present several graphs comparing the runtime of our 3 algorithms on multiple input graphs for multiple graph mining algorithms. This should clearly show when each algorithm is superior to the others.

We probably won’t show a demo of our programs actually running, because it wouldn’t look very interesting, and it might take a while to run.

WHAT WE HOPE TO LEARN

We want to discover if the improved compute time of GPUs is worth the possible overhead of moving large quantities of data, such as those found in modern day graphs.

Platform Choice

We plan to use the Latedays cluster to run our code, which will use Tesla K40 GPUs. The K40 are advanced GPUs used for scientific computing. Also, Latedays has a large memory and many multi-core CPUs making it ideal for parallelizing on the CPU as well. For consistency, we will also run the sequential SQL version on Latedays. In contrast, PEGASUS uses Hadoop cloud computing to run, so we can just run this with AWS.

Schedule

Week 0: (March 30 - April 5)

Week 1: (April 6 - April 12)

Week 2: (April 13 - April 19)

Week 3 - 4: (April 20 - May 3)

Week 5: (May 3 - May 10)

Having trouble with Pages? Check out the documentation at https://help.github.com/pages or contact support@github.com and we’ll help you sort it out.