Erik Espinoza

Proven Technologist, Leader & Team Builder

Professional Experience

Jul 2025 - Present

Apple - Manager, SRE, Messaging

Cupertino, CA

Apr 2024 - May 2025

dbt Labs - Manager, SRE, RelEng and DevEx

San Jose, CA (Remote)

At dbt, I am the Manager of Site Reliability Engineering, Release Engineering, and Developer Experience. We operate a fully remote team with members distributed throughout the US. The teams are responsible for dbt Cloud Incident Management, Observability, Release Infrastructure, and Developer tooling.

Managed and led nine full-time engineers.
Drove organizational consistency across SLOs, ensuring troubleshooting and debugging issues required little additional context.
Improved organizational handling of Incidents by powering status with Service Catalog, driving automation of metrics collections, and performing Engineering training.
Prioritized and created incident-mitigating features that reduced customer-impacting minutes.
Managed company-wide metrics surrounding fleet availability and incident follow-ups.

Jun 2022 - Jan 2024

BlueJeans by Verizon - Senior Manager, SRE & DevOps Tools

San Jose, CA

At BlueJeans I was the US Manager of Site Reliability Engineering and Developer Tools. We follow the sun and operate a second team in Bangalore, India. BlueJeans SRE operates Kubernetes infrastructure in hybrid (DC + Cloud) multi-Cloud (AWS + Azure). DevOps Tools operates Developer and IaC tooling.

Cultivated the team from four full-time Engineers to 12 full-time Engineers.
Composed and defined the BlueJeans sunset plan.
Designated an objective metric for application risk and targeted fixes for the riskiest services.
Delineated a Developer Tools lifecycle to ensure gaps were addressed and updates or deprecations occurred in a predictable fashion.
Authored and implemented an onboarding process for new services and functionality to ensure frequent and fruitful interaction between Dev and SRE. This delivered SLOs, background, and documentation on generic mitigations.
Led self-healing automation deployment capable of resolving 90% of current alerts without human intervention.

Sep 2020 - May 2022

NS1 - Engineering Manager, TechOps & Observability

San Jose, CA

At NS1, I managed the TechOps and Observability Teams. The TechOps team is a horizontal team that operates our Managed DNS products, focused on technical width. The Observability team focuses on extracting actionable data from internal metrics to allow data-based decisions at every level.

Manager of 12 Engineers across the US and Vietnam.
Created an Observability strategy centered on implementing high quality indicators (SLI) and objectives (SLO). This translates to major signal improvements.
Led reduction of pages from four thousand per month to under five hundred, while maintaining our Incident detection rate. This results in noise reduction.
Curated SDWAN deployment to stabilize the ChinaNet offerings, eliminating multiple hours of daily toil from previous workflow.
Designed and implemented a BeyondCorp / Zero Trust management plane, reducing complexity by retiring the inflexible team-based VPN solution.
Constructed a new Incident Management process, including postmortem and reviews. This focused on mitigation during customer pain and root cause + improvements occuring offline.

Oct 2018 - Aug 2020

Google LLC - Site Reliability Engineering Manager

Sunnyvale, CA

At Google I was responsible for the reliability of Google Cloud Storage and its internal counterpart, Blobstore. The SRE team was sharded into Serving and Backend. As Backend SRE Manager I was primarily focused with storage health including internal dependencies, durability and data integrity. In addition, I joined at a time of major team reorganization and had to hire and onboard others while personally onboarding.

Participated in the team as a design reviewer, code reviewer and oncaller.
Hired and onboarded six Software Engineers.
Roadmapping with other SRE shard, Dev partners and Dependency orgs.
Created training forum for GCS (30 attendees, twice weekly).
Led SLO improvements across GCS, from processes to implementations.
Taught Production Storage at SRE EDU, a mandatory week of training for all new hire SREs across the entire company.

Aug 2007 - Sep 2018

eBay Inc

San Jose, CA

Apr 2015 - Sept 2018 - Senior Manager, Infra Arch & Search SRE

At eBay I wore many hats. I was the Infrastructure Architecture Manager, Search SRE Manager, and a member of the Virtual Architecture Team focused on Infrastructure. The Search Infrastructure alone accounted for the majority of eBay’s Data Center space.

Participated in the Infra Arch team as an IC, creating my own blueprints and partnering with internal teams to resolve problems without clear next steps.
Manager of 12 Engineers and Architects. Served as a Tech Lead within GTO (70 Engineers between San Jose and Shanghai).
Manager of the Search SRE Manager in Shanghai, China.
Led the Search team from a 54% automation change rate to 96%. This resulted in the reduction of $500K OpEx annually by removing the need for contractors. This also helped drive the service to 99.999% of availability for 2016, 2017 and 2018.
Reduced footprint by 200 racks of gear through more efficient hardware, saving OpEx via Data Center savings.
Found and reported a DoS vulnerability in the F5 GTM Appliance (K23022557).
Award: eBay Cultural Luminary - 2018

Sep 2014 - Apr 2015 - Head of TechOps, Advertising

Manager of 12 professionals, Systems & Network Engineers + DBAs.
Worked with Product and Dev teams to create roadmaps for two business lines. This included effort, scoping and deliverables.
Developed TechOps Roadmap with my reports. Analyzed all infrastructure and identified our biggest threats to the business.
Led Holiday Readiness capacity adds. This included deep dives in 30 customer facing subsystems, 20 requiring capacity or architecture changes.
Reduced OpEx $300K annually by reducing duplication of external network services.
Created budget for 2015. YoY savings of $500K annually.

Jan 2014 - Sep 2014 - DevOps Manager, Advertising

Manager of six Professionals, DevOps & NetEng.
Leadership role in 150 person BU which generates over $400 million a year.
Lead triage and incident management process. This reduced unplanned Ops work to less than 5% of our sprint and ensured we didn’t fail the same way more than once.
Reduced duplication of efforts with my Ops counterparts in various cross BU projects.
Designed and implemented reliable disaster recovery architecture.
Developed Infrastructure for cloud-based data pipelines.

Aug 2007 - Dec 2013 - Lead Systems Engineer, Advertising

Led a team of 15 Operations professionals supporting a 24x7 production environment with 2,000 servers. This included 1700 Linux, 250 Windows and 50 Solaris hosts.
Provided architecture and design support on new apps and rewrites for every layer of our infrastructure. This includes Frontend, Images, Tracking and Import/Export of partner data.
Served as Commerce Lead on coast-to-coast data center migration. Migrated merchant ingestion system capable of 200 million SKU/hour and partner export system that generated custom feeds for over 1,000 partners with a 12-hour window.
Built network installer that bootstrapped 1,000 machines in a few hours for the data center migration. This utilized the native OS installer, CDPR, MySQL and PHP.
Led project to migrating our legacy connectivity to the Corporate eBay backbone. This reduced cost by $130K / year and had no service degradation.
Automated infrastructure management including a Github web-hook based DNS auto-update and syntax checking API using BIND, Apache and Perl.

Aug 2005 - Aug 2007

LIGO @ Caltech - Lead Systems Engineer

Pasadena, CA

Built and administered HPC cluster of 350 nodes with Condor scheduler.
Worked with Scientists to profile and optimize apps resulting in a 33% reduction in power and a 40% utilization reduction while keeping the same level of processing per day.
Built and managed Einstein@HOME mirror. This allowed us to augment our processing power by allowing anyone with our screensaver application to help us search for gravity waves.
Reworked Anaconda Installer to load a legacy Linux platform on unsupported hardware for ABI/API compatibility with existing cluster.

Jun 2003 - Aug 2005

JPL / NASA - Systems Engineer

Pasadena, CA

Created first space-to-web publishing system within NASA for the Mars Rovers website (Spirit and Opportunity).
Created and built multi-tenancy HA cluster for high traffic web sites - Mars Rovers, Deep Impact, Cassini, etc.
Built and Administered HPC Cluster for TES Instrument on Aura Spacecraft.
Wrote a BitTorrent wrapper to push large amounts of data to compute nodes. Our data pushes took four hours using BitTorrent and used to take seven days with the legacy scripts.