How to release GPU memory after each inference? · Issue #446 · microsoft/onnxruntime-genai · GitHub


Situation

  • I want to run multiple models in the same GPU, but onnxruntime-genai (ort-genai) does not release the GPU RAM after each inference. I want to find a way to optimize the GPU memory after each inference turn.

I have tried

  1. Add
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

after each inference, but it does not work, I guest the RAM is claimed by ort-genai not torch.

  1. Research: as in onnxruntime (with out genai), there’s some config option to shrink GPU memory after each run to release the memory, but I have not found any way to do the same in ort-genai.

Some of the possible way I can think of:

  • Run ort-genai in a thread (not sure how)
  • Use onnxruntime config (I guess there should be some way to config the father class?)

In best scenario, I want to release the available memory, keep the model to use later, otherwise, I wonder if I can remove the onnnxruntime-genai.Model instance to get back the memory.

Hope to hear every suggestion to achieve this.
Thank you for your time!



Source link

Leave a Comment