How to release GPU memory after each inference? · Issue #446 · microsoft/onnxruntime-genai · GitHub

Situation

I want to run multiple models in the same GPU, but onnxruntime-genai (ort-genai) does not release the GPU RAM after each inference. I want to find a way to optimize the GPU memory after each inference turn.

I have tried

if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

after each inference, but it does not work, I guest the RAM is claimed by ort-genai not torch.

Research: as in onnxruntime (with out genai), there’s some config option to shrink GPU memory after each run to release the memory, but I have not found any way to do the same in ort-genai.

Some of the possible way I can think of:

Run ort-genai in a thread (not sure how)
Use onnxruntime config (I guess there should be some way to config the father class?)

In best scenario, I want to release the available memory, keep the model to use later, otherwise, I wonder if I can remove the onnnxruntime-genai.Model instance to get back the memory.

Hope to hear every suggestion to achieve this.
Thank you for your time!

Source link

Leave a Comment Cancel reply