UPDATE: AWS changed its P3 instance and the pricing. The figures quoted here were from March 2018, when the cluster was built. For an updated analysis, and some new insight, please see the latest post on this topic.
I’ve been working on a system for detecting lung cancer from CT scans for 3 years. Last fall I submitted a manuscript to a premier medical informatics journal detailing this research. In the second round of revisions a reviewer questioned the size of my test dataset. I was conducting 9-fold cross-validation on a 10 bin dataset, using the 10th bin as test. In order to appease the reviewer, I choose to conduct the 9-fold cross-validation 10 times so I could use the entire dataset as test data. However, for each cross-validation bin I had to train 2 separate models, one of which contained over 100 million parameters. So, this project was going to be very computationally expensive.
I have 2 workstations (w/ 4 1070s, 3 1080Tis and a P100) at the university that I initially began using for this, but they weren’t going to cut it alone. In order to do this and meet the deadline for the revise-and-resubmit we were going to need a lot more compute. Doing this on AWS or other IaaS services would have been very expensive. Ultimately, I determined the most efficient option was to build my own cluster at home.
The home cluster cost ~$11,000 to build while equivalent compute on AWS is ~$15,000 per month.
The resulting cluster is shown above. It’s not pretty, but it gets the job done. In order to complete the study, I used this cluster for a total of ~1,000 hours. Based on this estimate we can do a rough calculation of the cost for equivalent computational capacity on AWS. The AWS P3 instances are $3.06 per hour. My experience comparing P100s and 1080Tis is the exact same as this benchmark, so I will estimate that each 1080Ti is equivalent to 90% of a P3 instance’s compute. This leads us to approximate each hour of 1080Ti use in my home cluster at roughly $2.75 /hr. Using 8 of these GPUs for 1,000 hours each at $2.75 /hr gives us a total of approximately $22,000. So, I saved roughly $11,000 on the revision alone by building my own cluster. And now I have the equivalent of roughly $15,000 per month of AWS compute in my home.
There certainly are downsides to this alternative. Prominently, one must be proficient at linux scripting in order to run a large number of jobs in parallel using an unmanaged network switch. Then, of course, there is the fact that these clusters are running on minimal hardware*:
- Intel Core-i5 6th gen 3.0Ghz processor
- 2TB RAID-10 primary disk
- 32GB DDR4 RAM
I ran into a trouble trying to load the training data into memory. This was simply not possible when training 3 separate models on each machine. There was even a little performance degradation once I had switched to loading training data from a directory, but it was just something that had to be dealt with.
For the purposes of my project, i.e. the revision, the minimalist cluster was more than satisfactory. However, the project was rather limited. The data directories were relatively small (compared to something like ImageNet) and it only consisted of straightforward supervised learning.
This summer I’m working on some DeepRL projects, and may work on some unsupervised tasks, too. I’ll try to update this or make a new post regarding the minimalist cluster’s performance on these other applications.
*All of the hardware was the cheapest possible option with the exception on the power supplies, which were all EVGA bronze or silver certification. All of the GPUs were purchased from eBay due to retailers being out of stock. Had the GPUs been purchased at their list price the entire cluster would have cost less than $9,000.