Drastically Reducing Out-of-Memory Errors in Apache Spark at Pinterest
Felix Loesing | Software EngineerIn 2025, we set out to drastically reduce out-of-memory errors (OOMs) and cut resource usage in our Spark applications by automatically identifying tasks with higher memory demands and retrying them on larger executors with a feature we call Auto Memory Retries.Spark...