Murray Stokely

Murray Stokely

Alumni (MIDS 2023)

Focus

Currently using data science in industry to solve complex problems in distributed systems design, cloud computing, and disaster readiness for online services.

Biography

Academic Background

I completed a graduate degree in Computer Science from the University of Oxford in 2005..  My thesis was on the subject of Formal Software Verification.  I completed a dual-major B.S. in Mathematics and Computer Science, and an M.S. in Mathematics from CSU East Bay in 2004.  I was named the Outstanding Senior of the Year in Mathematics as an undergraduate in 2003.  As part of my graduate studies in mathematics, I was awarded a grant from the National Science Foundation to study Mathematics in Moscow for a semester. After graduation, I have continued to pursue academic opportunities for professional development.  I have taken a number of graduate statistics classes that are relevant for my work in industry in areas such as Stochastic Processes, Timeseries Forecasting, Statistical Inference, Regression, and more.

Relevant Experience

Since completing my degree at Oxford and returning to work in Silicon Valley, I have worked on a number of interesting problems at the intersection of data science and computer science at Google, Apple, and Facebook, and I have published 7 peer-reviewed papers from this industrial research.  As part of this work, I have learned a great deal about mining unstructured data, utilizing distributed computing paradigms to process Exabytes of data, and formulating rigorous quantitative problems out of very dynamic business environment.  Some of my recent projects include:

  • Markov Modelling Of Availability and Failures.  At Google, I founded and led the “Storage Analytics” research group and one of our first projects was to characterize the availability properties of cloud storage systems based on an extensive one year study of Google’s main storage infrastructure and develop statistical models that enable further insight into the impact of multiple design choices, such as data placement and replication strategies. With these models we compared data availability under a variety of system parameters given the real patterns of failures observed in our fleet.  I presented this work at the premiere distributed systems conference in the field, OSDI, and this work continues to be widely cited in the field (with over 600 citations).
     
  • Ensemble Forecasting of Search Traffic.  At Google, my team was tasked with generating executive-level weekly forecasts and insights of global search traffic and trends.  To accomplish this, we built out an ensemble forecasting model that ran our timeseries of search query volume through a basket of different forecasting models (including ARIMA, Holt-Winters, Bayesian Structural Timeseries, etc.) and then took a trimmed mean of the forecasts to produce a final forecast which was more accurate than any one of the individual models in the ensemble could provide.  We also used bootstrap methods to generate realistic confidence intervals for our forecasts based on replaying prior empirical observations.  As part of this work, I wrote a number of custom R packages which have become widely used by the data science community at Google to enable data scientists working in higher level languages such as R to use Google’s distributed compute infrastructure to enable massive parallelism.
     
  • Human Rating and Labelling for Training  ML Pipelines to Recognize Features from Satelitte Images.  At Apple, my team built The Apple Rating Tool, a platform for crowdsourcing human judgements for improving Apple Maps, Siri, iTunes, and other online services. My teams built RESTful web applications in Java, mobile iOS apps, Python and R client libraries, and more. We worked closely with a large team of data scientists to sign-off on regular code and data deployments for Apple Maps, train machine learning models, and worked to make evaluation, metrics, log replay, and A/B testing at scale pervasive across Apple Maps.
     
  • Accelerate Migration of Ranking, Content Understanding Models to Custom AI Hardware.  At Facebook, I am interested in how we can continue to scale the size of our machine learning models for content understanding, filtering harmful content, and ranking relevant content.  In particular, I am looking at better understanding the limitations of distributed training vs scaling up with larger dedicated HPC-like clusters of machines and dedicated hardware accelerators.  Can we improve the abstractions provided by PyTorch and other frameworks to better utilize distributed training clusters?  Can NVM or other new storage technologies be leveraged for embedding tables and other data structures common in ML computations?

In each of these projects, I learned new statistical techniques that I have since been able to apply in other areas of my work.

Selected Publications

  • Take me to your leader! Online Optimization of Distributed Storage Configurations, VLDB'15.

  • Janus: Optimal Flash Provisioning for Cloud Storage Workloads, USENIX 2013.
  • Availability in Globally Distributed Storage Systems, OSDI 2010.

  • Using a market economy to provision compute resources across planet-wide clusters, IPDPS 2009

 

Last updated:

April 19, 2024