Master of Science (MS)
Replica-Exchange (RE) methods represent a class of algorithms that involve a large number of loosely-coupled ensembles and are used to understand physical phenomena -- ranging from protein folding dynamics to binding affinity calculations. We develop a framework for RE that supports different replica pairing and coordination mechanisms, that can use a wide range of production cyberinfrastructure concurrently. Additionally, our framework uses a flexible pilot-job implementation, which enables effective resource allocation for multiple replicas. We characterize the performance of two different RE algorithms - synchronous and asynchronous - at unprecedented scales on production distributed infrastructure (Teragrid and LONI). The synchronous RE algorithm is implemented with a centralized master, while the asynchronous RE algorithm is implemented with both centralized and decentralized replica management schemes. We evaluate the performance of the different algorithms and implementations when we scale-up the number of replicas (up to 256) on a single machine and when we scale-out across 2 and 4 machines. Both the synchronous and asynchronous algorithms perform similarly when the number of replicas is small. But as the number of replicas increase, in the synchronous RE, the synchronization cost increases the total time to completion. In the centralized asynchronous RE, the cost of managing many replicas in a centralized manner increases the time to completion but not as much as in the synchronous RE. The decentralized asynchronous RE scales much better with increasing number of replicas. When scaled-out across many machines, the performance of synchronous RE depends on whether the machines are homogeneous or heterogeneous. A heterogeneous infrastructure means increased synchronization costs. We also run tests to see if one of the algorithms is better suited to achieve more crosswalks and temperature mixing -- better sampling.
Document Availability at the Time of Submission
Release the entire work immediately for access worldwide.
Thota, Abhinav S., "Efficient replica-exchange across distributed production infrastructure" (2011). LSU Master's Theses. 1456.