About Statscraft
This conference is all about making monitoring easier, more accessible and more productive
Monitoring is crucial for detecting problems, optimizing performance, capacity planning, improving user experience and business impact... Yet in many companies, monitoring is an afterthought leading companies to miss out on the value of the data they collected. We often hear that "monitoring is hard" - and it can be, unless we do something about it.
Agenda
*this conference is Kosher and all talks are in biblical Hebrew
Gathering and signup
Monitoring time in a distributed database: a play in three acts
Summary
Monitoring time is tricky given its fluid nature. Doing so across distributed database hosts is trickier. Latency, probe intervals, clock synchronization, all affect the metrics, and taking actions based on those metrics makes matters even more complex. How does one measure time? What is the baseline? What accuracy and tradeoffs can we expect? Can we use time itself to affect the outcome? At GitHub, we monitor time in our database topologies for throttling and consistent reads purposes. We present our use case and share how we communicate metric information in our distributed company.
Slides
YouTube Video
How to think like an SRE
Itay Maman
Principal Engineer @ WixAddicted to dark chocolate and refactoring, Itay has been (and still is) coding, designing, debugging, architecting, and solving production outages for ages. He has recently joined Wix's infrastructure group working on Wix's internal build system. Earlier gigs include Testim.io (core data-platform that supports Testim's algorithm and backends), Google (a TL of Google Sites' infrastructure team), a lecturer at the Technion, and a few other places.
Summary
SREs (Site Reliability Engineers) are responsible for the reliability and availability of large scale distributed system at organizations such as Google, FB, Twitter, etc. Being a successful SRE requires developing an engineering perspective which takes into account various factors which "regular" software engineer are either unaware of, or just don't rank as important enough. In this talk I will highlight several concrete points where the thinking of an SRE and the thinking of a software engineer are likely to be different, across various aspects of system evolvement: Architecture planning, monitoring planning, coding, and operating an incident.
Slides (PDF)
YouTube Video
Break
...But what happens when DynamoDB explodes?
Erez Berkner
CEO & Co-founder @ LumigoErez is the CEO & co-founder of Lumigo, a startup focusing on simplifying serverless applications troubleshooting, where the entire backend is… 100% serverless. Prior to founding Lumigo, Erez was the R&D director of cloud products at Check Point, heading the company’s cloud strategy & execution.
Summary
2:32 am. PagerDuty wakes you up. DynamoDB is throttling. Should you wake up the team and fiercely charge to resolve the issue, or can it wait for tomorrow? Understanding the business impact and the affected users are the key points to making this decision. Those data points are usually not easy to obtain, especially in highly distributed asynchronous architectures like serverless. In this session, we will share guidelines on what needs to be part of serverless application monitoring in order to be able to answer those questions in a matter of minutes. The main operational questions, when things go bad: - What is the user functionality being affected? - Which users were affected and how? - What is the root cause of these issues? Getting a good night sleep is within arm’s reach...
Slides
YouTube Video
Visualization in Serverless Applications
Ran Ribenzaft
Co-Founder & CTO @ EpsagonI’m a passionate developer, with vast experience in network, infrastructure, and cyber-security. Constantly chasing new technologies - as the current one is Serverless. Love sharing open source tools to make everyone lives easier :) In my current role, I’m the co-founder and CTO at Epsagon - monitoring for serverless applications. I love swimming, traveling around the world, and taking breathtaking pictures.
Summary
Modern, distributed applications, are often seen as a graph of nodes and edges, each node represents a micro-service, a function, or an API service. Serverless applications take these to the extreme, as each component is smaller than ever. Visualization can help in several key aspects of designing and operating such applications: - Design the application and onboarding new team members - Debugging while developing - Troubleshooting complex issues in production - Identifying bottlenecks of performance and costs
Slides (PDF)
YouTube Video
Monitoria: A monitoring democracy
Yaron Idan
DevOps @ SolutoBeen chopping away at various types of software for 15 years now. After spending ~10 years as a DBA I shifted myself into a DevOps oriented approach and started spending my energy on mastering the toolchain. Currently an essential part of Soluto’s DevOps team and leading the effort of shifting our production workload to kubernetes, along with many other daily challenges.
Summary
Monitoring is important - but can also be complicated. At Soluto, we put a lot of effort into this, so we can know that our users get the best experience. We want to share how we transformed monitoring from a one man job to something every employee cares about and actively participates in.
Slides (PDF)
YouTube Video
Logging in the cloud: machines first human come second
Itiel shwartz
Lead production engineer @ RookoutOriginally a Backend developer now turned Devops. Started working at eBay but felt it’s too big. Joined a 50 ppl startup (Forter) but it got big again. So now I’m a first developer (Rookout) - hope to make my company big :)
Summary
Our system complexity is growing like crazy, what about our logging? While writing new code we should think about: The (poor) developer trying to debug this in production. Sadly we feel like our code is great - so need to log it :) Come and find out what’s broken in logging design, and how to fix it
Slides (PDF)
YouTube Video
Lunch
Monitoring lessons from Waze SRE team
Yonit Gruber-Hazani
System Operations Engineer, Waze SRE TeamYonit is a System Operations Engineer at Google and works on the Waze infrastructure team in Tel Aviv, which manages and monitors thousands of servers in multiple cloud providers. Yonit has more than 20 years of experience in the DevOps world, integrating open source apps into startups and supporting various web stacks.
Summary
Our monitoring concepts are changing in a world of thousands of servers, from the single pets, to the holistic view of a system as a collection of critical user journeys. How do you enable developers to self monitor their hundreds of microservices, with hundreds of deployments and configuration changes per day? Do we really need to monitor every servers cpu? How many devops do you need to wake in the middle of the night for every disk that dies? What are the lessons we learned while trying to monitor our ever changing system?
Slides
YouTube Video
Exploiting monitoring for fun and profit
Omer Levi Hevroni
DevSecOps Engineer @ SolutoI’m coding since 4th grade when my dad taught me BASIC and haven’t looked back since. I’m an AppSec/DevSecOps enthusiast, and always curious about integrating more hacking tools into the CI/CD pipeline .I’m always looking for new interesting ways to increase security awareness over the entire R&D – developers, product, and UX. I highly believe in OWASP and a proud member. Besides that, I’m also OWASP Glue project leader. I am an open source addict, using OSS heavily and keep contributing back. Today I’m working at Soluto. And most important – I’m a proud father to two beloved children, and happily married.
Summary
We all use monitoring - after all, we all want to ensure that our applications are working as expected. But can hackers use it to exploit our application? Join me to explore different ways to exploit application monitoring tools, and what mitigations we can use - including live demos!
Slides
YouTube Video
Monitoring done wrong
Avishai Ish-Shalom
Engineer in Residence @ Aleph VCSoftware janitor, data plumber, infrastructure handyman. Will code for scotch.
Summary
We've all heard numerous "awesome monitoring @ X" talks; Boring! Join me in exploring monitoring design principles through various fails - because we can learn sooo much more by analyzing cases where monitoring was done wrong :-)
Slides
YouTube Video
Spanning services - The practical guide to Distributed Tracing
Yair Galler
Backend Engineering Team Lead @ Next InsuranceYair Galler is an engineering team lead at Next Insurance’s backend group. He has over 15 years of full-stack software engineering experience tackling everything from devops and backend challenges to the latest frontend frameworks.
Summary
What is distributed tracing and why should I care? This talk will introduce tracing concepts, help you get started with it and share practical dos and don’ts.
Slides (PDF)
YouTube Video
Happy hour and games
Zen of production incident management
Dan Yelovitch
Senior SRE @ ZipRecruiterSummary
Zen of production incident management, or to be put it simply, how to handle production incidents at any scale.
Slides (PDF)
YouTube Video
Monitoring Driven Debugging
Anna Tsibulskaya
Systems engineer @ WixDistributed HPC monitoring
Salo Shp
Solutions architect @ TikalActionable alerts
Savva Khalaman
Backend developer @ WixDefining KPIs Like a Pro
Elena Levi
Product Manager @ AppsFlyerSensory Friendly Monitoring: Keeping the Noise Down
Quintessence Anx
Developer Advocate @ Logz.ioQuintessence has worked in the IT community for over 10 years, including as a database administrator and a DevOps / Cloud / Infrastructure engineer. She was a core contributor to Stark & Wayne's SHIELD project, which adds backup functionality to Cloud Foundry, as well as a technical reviewer for Learning Go Programming published by Packt Publishing. Currently she is the US Developer Advocate for Logz.io, focusing community engagement related to DevOps with a focus on monitoring and observability. Outside of work she is a leader and co-founder of Inclusive Tech Buffalo to help underrepresented minorities in tech launch careers in development in the Buffalo region.
Summary
The ability to monitor infrastructure has been exploding with new tools on the market and new integrations, so the tools can speak to one another, leading to even more tools, and to a hypothetically very loud monitoring environment with various members of the engineering team finding themselves muting channels, individual alerts, or even alert sources so they can focus long enough to complete other tasks. There has to be a better way - a way to configure comprehensive alerts that send out notifications with the appropriate level of urgency to the appropriate persons at the appropriate time. And in fact there is: during this talk I’ll be walking through different alert patterns and discussing: what we need to know, who needs to know it, as well as how soon and how often do they need to know.
Slides
YouTube Video
Organizing Committee
This conference is a community effort by and for people who do monitoring daily and care about monitoring. The organizing committee are all volunteers and sponsorships cover the direct costs of the conference.
Shlomi Noach
Staff Software Engineer @ GitHubShlomi is a developer and database geek. He is an active MySQL community member, authors orchestrator, gh-ost, freno and other open source tools, and blogs at http://openark.org. He works at GitHub on the database infrastructure team solving high availability, reliability, enablement problems, running automation and testing. Shlomi is the recipient of MySQL Community Member of the Year, Oracle ACE (Alumni) & Oracle Technologist of the Year awards.