As organizations start building and running more machine learning models in production, they’re often surprised by how different these workloads behave compared to traditional applications. It’s not just about compute power, data storage plays a far more critical role than many teams initially expect.
That’s where Ceph comes in.
The data challenge in machine learning
Machine learning (ML) pipelines typically involve massive volumes of unstructured data such as images, video, logs, and sensor data stored in object format. But it’s not just about volume. These workloads also require high-throughput reads during machine learning training, efficient parallel access, and fast write performance during preprocessing and result logging.
If your storage cannot keep up, it becomes a bottleneck. This leads to slower model iterations, delayed deployments, and unnecessary infrastructure costs.
Ceph is already part of many AI stacks even if it’s not always obvious
A growing number of ML platforms including Kubeflow, OpenShift AI, and custom Kubernetes-based stacks rely on Ceph under the hood, especially when running on OpenStack or on-prem Kubernetes clusters.
Ceph’s flexibility across block, file, and object storage makes it well suited for different stages of the ML lifecycle:
- Object storage (via RGW) is ideal for storing raw datasets and model artifacts.
- Block storage (via RBD) supports high performance training jobs, particularly on GPU nodes.
- CephFS works well for shared access across distributed training environments or preprocessing pipelines.
This versatility means you can build a consistent and scalable storage layer for your entire machine learning pipeline without adding new silos.
But Ceph still needs tuning for ML workloads
While Ceph can support ML workloads well, it is not plug and play. Some key areas that often need attention:
- IO tuning: Training jobs may demand high read throughput from object storage, especially with large datasets. Adjusting RADOS and RGW parameters can improve performance.
- Network and placement group design: Distributed training can generate intense east west traffic. Poor placement or underpowered OSD nodes can cause latency spikes.
- Hardware choices: SSDs are strongly recommended for metadata intensive operations, especially with CephFS. Fast NVMe storage can make a real difference for preprocessing heavy jobs.
- Multi tenancy controls: If you’re serving multiple teams or pipelines, quota and QoS settings become critical to avoid noisy neighbor problems.
Looking ahead: scaling without surprises
As ML use grows across your organization, storage needs tend to scale faster than expected. Teams add more data, rerun experiments, and version models more frequently. Ceph’s horizontal scalability is a big advantage here, but only if the cluster is monitored and managed proactively.
This is where operational experience matters. Whether it’s scaling RGW for object intensive workloads or making decisions about cache tiering or CRUSH map design, storage choices will directly influence how fast teams can ship new models.
Final thoughts
Machine learning is not just compute intensive. It is storage hungry, operationally complex, and growing fast. Ceph offers a strong foundation to build on, but that foundation needs to be designed with workload patterns in mind.
If your organization is expanding its AI capabilities, now is a good time to re-evaluate your storage architecture. It is easier to prepare a Ceph cluster for ML workloads early on than to fix bottlenecks later under pressure.
42on helps organizations do just that, with deep experience running, optimizing, and supporting Ceph in production, including for hybrid and AI driven environments.
Let us know if you’d like to explore what your storage environment needs to keep pace.