Software Engineer - Reliability/SRE - Intermediate, Cape Town
Software Engineer - Reliability/SRE - Intermediate, Cape Town
-
Cape Town, South Africa
-
Last edited: less than a month ago
-
Save
Description
Electrum is a next-generation payment software technology company.
Since 2012, we've delivered trusted, enterprise-grade, cloud-native software to optimise financial transaction processing. Our deep expertise has established us as a respected partner in high-volume, low-value payment schemes, enabling clients to deliver services to millions of South Africans daily.
At Electrum, we are
grounded in impact
– designing solutions that matter, acting with urgency, and continuously learning as we scale. We believe in
creating together
– working side by side with our clients and teams to build meaningful, lasting solutions. We prioritise
making it safe
– encouraging open communication, smart risk-taking, and trust so that creativity and alignment thrive. And we back
empowered strong teams
– hiring brilliant people, collaborating hard, and holding each other to high standards while leading with empathy and kindness.
The Role As a Core Reliability Engineer, you will not be joining a traditional 24/7 operations team. Instead, you will act as a central software team enabler, defining the standards, observability tooling, and automation frameworks that allow our stream-aligned product teams to own their service health independently.
Reliability in our specific FinTech niche isn't just about keeping servers up; it's about processing high-volume, widely impacting financial transactions where a dropped message has real-world consequences. We are looking for an innovative systems thinker who wants to solve difficult industry problems, architect for scale alongside reliabiilty and help usset the industry benchmark for reliability in payments.
Your ultimate goal is to ensure reliable software is easy to build, and when we fall short, we know about it before our clients do.
Responsibilities
Enablement&RelOps Culture
Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build the dashboards that track specific error budgets.
Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.
Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).
Frameworks&Automation
Standardised Alerting&On-Call: Continuously improve our company-wide alerting and on-call frameworks to reduce alert fatigue and ensure that, when a pager goes off, the alert is highly actionable and symptom-based.
Disaster Recovery: Drive to evolve our DR strategies from manual processes into fully automated \"runbooks-as-code.\" You'll build the tooling that allows teams to prove and improve their service’s recoverability through autonomous, evidence-based testing.
Eliminate Toil: Develop systems, automations and tooling, e.g. for pre- and post-deployment verification, ensuring that our \"hands-off\" reliability vision becomes a production reality, via Python (or similar).
Reliability-as-Code: Lead the drive to manage our entire reliability suite through Infrastructure as Code. Use Terraform to architect, deploy, and configure our observability stack - including ELK, Grafana, Loki, Prometheus, and Tracing - ensuring our monitoring environment is as reliable as our production systems,
Education:
Bachelor's degree in Computer Science, Information Technology, or a related field.
Experience:
2+ years of experience in Software Engineering, SRE, DevOps, or Platform Engineering.
Strong coding fluency: Proficiency in Python (or similar) with the ability to read, understand, reason about, and write automation code (e.g., scripting automated deployment rollbacks).
Cloud&IaC: Hands-on experience with AWS, and a solid understanding of Infrastructure as Code (Terraform or CloudFormation) to prevent configuration drift.
Deep Observability Knowledge: Demonstrable experience with monitoring tools (DataDog, Prometheus, ELK stack). You should understand SRE concepts like \"Golden Signals,\" high-cardinality data handling, and the math behind error rates.
Systems Thinking: A strong grasp of designing for scale and resilience, including concepts like graceful failure, circuit breaking, connection pooling, and multi-AZ deployments.
Why Join Electrum?
We believe in a
People First
approach, ensuring a culture where you can thrive and make a real difference
Your Career&Culture
Career Growth: Delivering world-class financial software is challenging, but your effort will earn you hands-on experience with products used by millions,
accelerating your career.
Strong Teams : We keep teams small, focused, and collaborative to maximize
impact .
Transparency : We openly discuss strategy, finances, and salaries. Mistakes are viewed as l earning opportunities
that we actively discuss.
Autonomy : We
trust you . You're expected to seek out the data needed for informed decisions and manage your own time—knowing when to focus and when to recharge.
Shared Vision : You'll have the power to
shape the vision
of how we build the future of financial services.
Practical Perks
Here’s how we support our culture:
Flexible Work:
Office-first environment with
flexible hours .
Generous Leave:
Starting at
20 days per year.
Office Perks
(Cape Town): Fully-stocked kitchen and
daily catered lunch .
Social Life:
Regular team activities like hikes, getaways, and dinners
#J-18808-Ljbffr
Since 2012, we've delivered trusted, enterprise-grade, cloud-native software to optimise financial transaction processing. Our deep expertise has established us as a respected partner in high-volume, low-value payment schemes, enabling clients to deliver services to millions of South Africans daily.
At Electrum, we are
grounded in impact
– designing solutions that matter, acting with urgency, and continuously learning as we scale. We believe in
creating together
– working side by side with our clients and teams to build meaningful, lasting solutions. We prioritise
making it safe
– encouraging open communication, smart risk-taking, and trust so that creativity and alignment thrive. And we back
empowered strong teams
– hiring brilliant people, collaborating hard, and holding each other to high standards while leading with empathy and kindness.
The Role As a Core Reliability Engineer, you will not be joining a traditional 24/7 operations team. Instead, you will act as a central software team enabler, defining the standards, observability tooling, and automation frameworks that allow our stream-aligned product teams to own their service health independently.
Reliability in our specific FinTech niche isn't just about keeping servers up; it's about processing high-volume, widely impacting financial transactions where a dropped message has real-world consequences. We are looking for an innovative systems thinker who wants to solve difficult industry problems, architect for scale alongside reliabiilty and help usset the industry benchmark for reliability in payments.
Your ultimate goal is to ensure reliable software is easy to build, and when we fall short, we know about it before our clients do.
Responsibilities
Enablement&RelOps Culture
Implement the Observability Ladder: Guide teams from basic monitoring to high-signal metric tracking. Work with product teams to define SLAs, SLIs, and SLOs, and build the dashboards that track specific error budgets.
Empower Product Teams: Build frameworks and deployment tooling (e.g., CI/CD, internal tooling integrations) that allow teams to make data-driven decisions on deployment safety and automate rollbacks when error budgets are depleted.
Champion Reliability: Drive a blameless post-mortem culture focused on actionable takeaways, system improvements, and measurable metrics (MTBF, MTTR).
Frameworks&Automation
Standardised Alerting&On-Call: Continuously improve our company-wide alerting and on-call frameworks to reduce alert fatigue and ensure that, when a pager goes off, the alert is highly actionable and symptom-based.
Disaster Recovery: Drive to evolve our DR strategies from manual processes into fully automated \"runbooks-as-code.\" You'll build the tooling that allows teams to prove and improve their service’s recoverability through autonomous, evidence-based testing.
Eliminate Toil: Develop systems, automations and tooling, e.g. for pre- and post-deployment verification, ensuring that our \"hands-off\" reliability vision becomes a production reality, via Python (or similar).
Reliability-as-Code: Lead the drive to manage our entire reliability suite through Infrastructure as Code. Use Terraform to architect, deploy, and configure our observability stack - including ELK, Grafana, Loki, Prometheus, and Tracing - ensuring our monitoring environment is as reliable as our production systems,
Education:
Bachelor's degree in Computer Science, Information Technology, or a related field.
Experience:
2+ years of experience in Software Engineering, SRE, DevOps, or Platform Engineering.
Strong coding fluency: Proficiency in Python (or similar) with the ability to read, understand, reason about, and write automation code (e.g., scripting automated deployment rollbacks).
Cloud&IaC: Hands-on experience with AWS, and a solid understanding of Infrastructure as Code (Terraform or CloudFormation) to prevent configuration drift.
Deep Observability Knowledge: Demonstrable experience with monitoring tools (DataDog, Prometheus, ELK stack). You should understand SRE concepts like \"Golden Signals,\" high-cardinality data handling, and the math behind error rates.
Systems Thinking: A strong grasp of designing for scale and resilience, including concepts like graceful failure, circuit breaking, connection pooling, and multi-AZ deployments.
Why Join Electrum?
We believe in a
People First
approach, ensuring a culture where you can thrive and make a real difference
Your Career&Culture
Career Growth: Delivering world-class financial software is challenging, but your effort will earn you hands-on experience with products used by millions,
accelerating your career.
Strong Teams : We keep teams small, focused, and collaborative to maximize
impact .
Transparency : We openly discuss strategy, finances, and salaries. Mistakes are viewed as l earning opportunities
that we actively discuss.
Autonomy : We
trust you . You're expected to seek out the data needed for informed decisions and manage your own time—knowing when to focus and when to recharge.
Shared Vision : You'll have the power to
shape the vision
of how we build the future of financial services.
Practical Perks
Here’s how we support our culture:
Flexible Work:
Office-first environment with
flexible hours .
Generous Leave:
Starting at
20 days per year.
Office Perks
(Cape Town): Fully-stocked kitchen and
daily catered lunch .
Social Life:
Regular team activities like hikes, getaways, and dinners
#J-18808-Ljbffr
Highlights
-
Company nameElectrum
-
Job positionSoftware Engineer - Reliability/SRE - Intermediate
Safety Tips
Be careful with jobs that explicitly state ’no experience needed’.
More info about this ad
Software Engineer - Reliability/SRE - Intermediate has been posted in the Cape Town Engineering category on Locanto.
If you’re still wanting to browse, there is so much to explore in the Engineering category! Take a look at the ads Regional Head, Supply Chain and Business Development (All …, Cape Town, Qualified Mechanical Fitter, Cape Town and Senior Computer Systems Engineer (CPT) in Cape Town to discover more of what you’re looking for. Right now, there are 154 classified ads in Engineering in Cape Town on Locanto.
There are more ads within a 15 km radius for this category. If you want to view those ads, click here.