Premise I just wanted to pin this topic, so that it can be used for future reference.
For large models that do not fit in memory, there is the approach.
MediaPipe is a framework for building multimodal, cross-platform applied Machine Learning pipelines which can be used for face detection, multi-hand tracking, object detection, and more.