|講者： Jonathan Hsieh / Tech Lead & Engineering Manager @ Cloudera
講題：The Rise of Open Source Data Platforms: An Insider’s view
Today, Big Data and open source data platforms are a multi-billion-dollar industry. New technologies, projects, and companies sprout up weekly to help people use data to make smarter data-backed decisions. It didn’t start this way though. I’ll share the story from an insider’s point of view and also present interesting trends about what’s next.
Over the past 10 years, I’ve been an engineer on several near real-time systems and streaming data systems. I’ll share some personal stories about the humble beginnings of Big Data at the Apache Software Foundation and at Cloudera. Massive shifts in hardware, economics, and the growth of open source have made today’s technologies possible. What started as a scale-out on-prem batch processing system has evolved into a set of maturing streaming and near real-time systems. What only the largest of internet companies could deploy in the past, any enterprise or college student could employ today. What was small group of researchers and enthusiasts in the U.S. has become a community of companies with professional developers around the world.
The road to today’s capabilities was not completely smooth. New hardware and economic trends will continue to disrupt the industry. Open source projects will rise and will fall over time but those with the strongest communities will survive. The good news is that there is plenty of opportunity to come and that it is open for you all to join in on! Recent trends like maturing core systems allows us to shift the emphasis towards machine learning and data science. New hardware advancements such as persistent memories and a shift towards cloud-based deployments will enable larger scale and even lower latencies. Finally, and most importantly, the continued addition of contributors from around the world (especially Asia!) in open source and default open source companies will change who drives innovation.
Jonathan Hsieh (謝明心) is most recently the Tech Lead and Engineering Manager of the HBase team at Cloudera. He joined Cloudera back in 2009 as one of their earliest engineers. While there, he founded and built community around the streaming data ingestion project, Apache Flume; became a committer, project management committee member, on the near-realtime datastore, Apache HBase; and became an Apache Foundation Member. Prior to Cloudera he was a graduate student at University of Washington where he did research on distributed near-real-time systems.
- Data Science in the Enterprise
- Building Deep Learning Pipelines on Apache Spark for ads optimization