Date: May 14, 2019

About Statscraft

This conference is all about making monitoring easier, more accessible and more productive

Monitoring is crucial for detecting problems, optimizing performance, capacity planning, improving user experience and business impact... Yet in many companies, monitoring is an afterthought leading companies to miss out on the value of the data they collected. We often hear that "monitoring is hard" - and it can be, unless we do something about it.

Agenda

*this conference is Kosher and all talks are in biblical Hebrew

09:00 - 09:30 Break

Gathering and signup

Mingling FTW

09:30 - 10:15 Talk

Monitoring time in a distributed database: a play in three acts

Shlomi Noach

Shlomi Noach

Staff Software Engineer @ GitHub

Shlomi is a developer and database geek. He is an active MySQL community member, authors orchestrator, gh-ost, freno and other open source tools, and blogs at http://openark.org. He works at GitHub on the database infrastructure team solving high availability, reliability, enablement problems, running automation and testing. Shlomi is the recipient of MySQL Community Member of the Year, Oracle ACE (Alumni) & Oracle Technologist of the Year awards.

Summary

Monitoring time is tricky given its fluid nature. Doing so across distributed database hosts is trickier. Latency, probe intervals, clock synchronization, all affect the metrics, and taking actions based on those metrics makes matters even more complex. How does one measure time? What is the baseline? What accuracy and tradeoffs can we expect? Can we use time itself to affect the outcome? At GitHub, we monitor time in our database topologies for throttling and consistent reads purposes. We present our use case and share how we communicate metric information in our distributed company.

Slides

YouTube Video

10:15 - 11:00 Talk

How to think like an SRE

Itay Maman

Itay Maman

Principal Engineer @ Wix

Addicted to dark chocolate and refactoring, Itay has been (and still is) coding, designing, debugging, architecting, and solving production outages for ages. He has recently joined Wix's infrastructure group working on Wix's internal build system. Earlier gigs include Testim.io (core data-platform that supports Testim's algorithm and backends), Google (a TL of Google Sites' infrastructure team), a lecturer at the Technion, and a few other places.

Summary

SREs (Site Reliability Engineers) are responsible for the reliability and availability of large scale distributed system at organizations such as Google, FB, Twitter, etc. Being a successful SRE requires developing an engineering perspective which takes into account various factors which "regular" software engineer are either unaware of, or just don't rank as important enough. In this talk I will highlight several concrete points where the thinking of an SRE and the thinking of a software engineer are likely to be different, across various aspects of system evolvement: Architecture planning, monitoring planning, coding, and operating an incident.

Slides (PDF)

YouTube Video

11:00 - 11:30 Break

Break

Coffee, anyone?

11:30 - 12:00 Talk

...But what happens when DynamoDB explodes?

Erez Berkner

Erez Berkner

CEO & Co-founder @ Lumigo

Erez is the CEO & co-founder of Lumigo, a startup focusing on simplifying serverless applications troubleshooting, where the entire backend is… 100% serverless. Prior to founding Lumigo, Erez was the R&D director of cloud products at Check Point, heading the company’s cloud strategy & execution.

Summary

2:32 am. PagerDuty wakes you up. DynamoDB is throttling. Should you wake up the team and fiercely charge to resolve the issue, or can it wait for tomorrow? Understanding the business impact and the affected users are the key points to making this decision. Those data points are usually not easy to obtain, especially in highly distributed asynchronous architectures like serverless. In this session, we will share guidelines on what needs to be part of serverless application monitoring in order to be able to answer those questions in a matter of minutes. The main operational questions, when things go bad: - What is the user functionality being affected? - Which users were affected and how? - What is the root cause of these issues? Getting a good night sleep is within arm’s reach...

Slides

YouTube Video

11:30 - 12:00 Talk

Visualization in Serverless Applications

Ran Ribenzaft

Ran Ribenzaft

Co-Founder & CTO @ Epsagon

I’m a passionate developer, with vast experience in network, infrastructure, and cyber-security. Constantly chasing new technologies - as the current one is Serverless. Love sharing open source tools to make everyone lives easier :) In my current role, I’m the co-founder and CTO at Epsagon - monitoring for serverless applications. I love swimming, traveling around the world, and taking breathtaking pictures.

Summary

Modern, distributed applications, are often seen as a graph of nodes and edges, each node represents a micro-service, a function, or an API service. Serverless applications take these to the extreme, as each component is smaller than ever. Visualization can help in several key aspects of designing and operating such applications: - Design the application and onboarding new team members - Debugging while developing - Troubleshooting complex issues in production - Identifying bottlenecks of performance and costs

Slides (PDF)

YouTube Video

12:00 - 12:30 Talk

Monitoria: A monitoring democracy

Yaron Idan

Yaron Idan

DevOps @ Soluto

Been chopping away at various types of software for 15 years now. After spending ~10 years as a DBA I shifted myself into a DevOps oriented approach and started spending my energy on mastering the toolchain. Currently an essential part of Soluto’s DevOps team and leading the effort of shifting our production workload to kubernetes, along with many other daily challenges.

Summary

Monitoring is important - but can also be complicated. At Soluto, we put a lot of effort into this, so we can know that our users get the best experience. We want to share how we transformed monitoring from a one man job to something every employee cares about and actively participates in.

Slides (PDF)

YouTube Video

12:00 - 12:30 Talk

Logging in the cloud: machines first human come second

Itiel shwartz

Itiel shwartz

Lead production engineer @ Rookout

Originally a Backend developer now turned Devops. Started working at eBay but felt it’s too big. Joined a 50 ppl startup (Forter) but it got big again. So now I’m a first developer (Rookout) - hope to make my company big :)

Summary

Our system complexity is growing like crazy, what about our logging? While writing new code we should think about: The (poor) developer trying to debug this in production. Sadly we feel like our code is great - so need to log it :) Come and find out what’s broken in logging design, and how to fix it

Slides (PDF)

YouTube Video

12:30 - 13:30 Break

Lunch

Yay! Food!

13:30 - 14:00 Talk

Monitoring lessons from Waze SRE team

Yonit Gruber-Hazani

Yonit Gruber-Hazani

System Operations Engineer, Waze SRE Team

Yonit is a System Operations Engineer at Google and works on the Waze infrastructure team in Tel Aviv, which manages and monitors thousands of servers in multiple cloud providers. Yonit has more than 20 years of experience in the DevOps world, integrating open source apps into startups and supporting various web stacks.

Summary

Our monitoring concepts are changing in a world of thousands of servers, from the single pets, to the holistic view of a system as a collection of critical user journeys. How do you enable developers to self monitor their hundreds of microservices, with hundreds of deployments and configuration changes per day? Do we really need to monitor every servers cpu? How many devops do you need to wake in the middle of the night for every disk that dies? What are the lessons we learned while trying to monitor our ever changing system?

Slides

YouTube Video

13:30 - 14:00 Talk

Exploiting monitoring for fun and profit

Omer Levi Hevroni

Omer Levi Hevroni

DevSecOps Engineer @ Soluto

I’m coding since 4th grade when my dad taught me BASIC and haven’t looked back since. I’m an AppSec/DevSecOps enthusiast, and always curious about integrating more hacking tools into the CI/CD pipeline .I’m always looking for new interesting ways to increase security awareness over the entire R&D – developers, product, and UX. I highly believe in OWASP and a proud member. Besides that, I’m also OWASP Glue project leader. I am an open source addict, using OSS heavily and keep contributing back. Today I’m working at Soluto. And most important – I’m a proud father to two beloved children, and happily married.

Summary

We all use monitoring - after all, we all want to ensure that our applications are working as expected. But can hackers use it to exploit our application? Join me to explore different ways to exploit application monitoring tools, and what mitigations we can use - including live demos!

Slides

YouTube Video

14:00 - 14:30 Talk

Monitoring done wrong

Avishai Ish-Shalom

Avishai Ish-Shalom

Engineer in Residence @ Aleph VC

Software janitor, data plumber, infrastructure handyman. Will code for scotch.

Summary

We've all heard numerous "awesome monitoring @ X" talks; Boring! Join me in exploring monitoring design principles through various fails - because we can learn sooo much more by analyzing cases where monitoring was done wrong :-)

Slides

YouTube Video

14:00 - 14:30 Talk

Spanning services - The practical guide to Distributed Tracing

Yair Galler

Yair Galler

Backend Engineering Team Lead @ Next Insurance

Yair Galler is an engineering team lead at Next Insurance’s backend group. He has over 15 years of full-stack software engineering experience tackling everything from devops and backend challenges to the latest frontend frameworks.

Summary

What is distributed tracing and why should I care? This talk will introduce tracing concepts, help you get started with it and share practical dos and don’ts.

Slides (PDF)

YouTube Video

14:30 - 15:30 Break

Happy hour and games

15:30 - 16:00 Ignite

Zen of production incident management

Dan Yelovitch

Dan Yelovitch

Senior SRE @ ZipRecruiter

Summary

Zen of production incident management, or to be put it simply, how to handle production incidents at any scale.

Slides (PDF)

YouTube Video

15:30 - 16:00 Ignite

Monitoring Driven Debugging

Anna Tsibulskaya

Anna Tsibulskaya

Systems engineer @ Wix

Summary

A different perspective on debugging issues in production

Slides (PDF)

YouTube Video

15:30 - 16:00 Ignite

Distributed HPC monitoring

Salo Shp

Salo Shp

Solutions architect @ Tikal

Summary

Slides

YouTube Video

15:30 - 16:00 Ignite

Actionable alerts

Savva Khalaman

Savva Khalaman

Backend developer @ Wix

Summary

Slides (PDF)

YouTube Video

15:30 - 16:00 Ignite

Defining KPIs Like a Pro

Elena Levi

Elena Levi

Product Manager @ AppsFlyer

Summary

Slides (PDF)

YouTube Video

16:00 - 16:30 Talk

Sensory Friendly Monitoring: Keeping the Noise Down

Quintessence Anx

Quintessence Anx

Developer Advocate @ Logz.io

Quintessence has worked in the IT community for over 10 years, including as a database administrator and a DevOps / Cloud / Infrastructure engineer. She was a core contributor to Stark & Wayne's SHIELD project, which adds backup functionality to Cloud Foundry, as well as a technical reviewer for Learning Go Programming published by Packt Publishing. Currently she is the US Developer Advocate for Logz.io, focusing community engagement related to DevOps with a focus on monitoring and observability. Outside of work she is a leader and co-founder of Inclusive Tech Buffalo to help underrepresented minorities in tech launch careers in development in the Buffalo region.

Summary

The ability to monitor infrastructure has been exploding with new tools on the market and new integrations, so the tools can speak to one another, leading to even more tools, and to a hypothetically very loud monitoring environment with various members of the engineering team finding themselves muting channels, individual alerts, or even alert sources so they can focus long enough to complete other tasks. There has to be a better way - a way to configure comprehensive alerts that send out notifications with the appropriate level of urgency to the appropriate persons at the appropriate time. And in fact there is: during this talk I’ll be walking through different alert patterns and discussing: what we need to know, who needs to know it, as well as how soon and how often do they need to know.

Slides

YouTube Video

Sponsors

Organizing Committee

This conference is a community effort by and for people who do monitoring daily and care about monitoring. The organizing committee are all volunteers and sponsorships cover the direct costs of the conference.

Sharone Raveh-Zitzman

Eliran Ben-Zikri

Avishai Ish-Shalom