Distributed system are becoming increasingly important in day to day computing tasks. While building such systems is a well understood process, measuring and monitoring such systems is not. This paper describes the difficulties inherent in measuring distributed systems, and enumerates four goals for such measurement: longevity, flexibility, fault-tolerance, and unintrusiveness. It then describes the Coda File System, a research system that has been deployed in a moderately-sized user community for four years, and actively measured for three and a half years. The architecture of this measurement framework is described in detail, with an eye toward examining how well it meets the four goals. The paper concludes with the lessons to be taken from this experience, both those that were foreseen as well as those that were learned along the way. In an effort to help teach these lessons, we have made the Coda source code, along with the measurement framework, freely available for any purpose.
bnoble@cs.cmu.edu