pandas – Python – How to utilise more memory


It’s not so easy to understand how you think that providing more memory will make dataframe operations run faster. If you were at 80% utilisation then it didn’t seem to be a barrier.

However, Linux has swap space which could, justifiably, be slowing down the computation. It could be that the OS has decided that it has to start dumping data and then reading it back from disk. I don’t think that’s the issue here, but you could read that info from something like htop. The fact that upgrading your RAM hasn’t changed processing speed doesn’t suggest that this was the issue, though.

RAM is a potential limitation in processing speed, but its increase doesn’t directly improve processing speed. How could it? It’s the CPU that controls processing speed.

Your CPU utilisation is low, and that’s what affects speed. That’s because a lot of processes in pandas are single-threaded. Python has the GIL which will be a real limitation. There are lots of ways of getting around this in compiled extensions (which pandas, and the libraries it’s built on, sometimes use) but it’s hard to know a priori whether a particular operation you’re using actually does this.

polars seeks to address this in a number of ways. It multi-threaded a lot of its operations (to get around the GIL where pandas doesn’t) and it also has “lazy evaluation” which does the opposite of using more RAM by only selecting data that you actually need for your calculation, rather than loading the whole lot up-front.

None of this information will help you if you designed your code with some pathological loop that completely hamstrings the libraries from the optimisations, but hopefully it gives you some pointers on how to proceed.



Source link

Leave a Comment