Stop using for-loops and harness the power of vectorization
Even though AI helps to learn programming and vectorization is well-known, some Python code must be enhanced. This is kind advice for for-loops lovers.
We all know how to use for-loops just because it’s the basis for tasks that need iteration in any programming language. As we learned in high school or university we used it for sorting algorithms or vector multiplication in Linear Algebra or Numeric Methods classes, some of us are still using this loop in real-world projects even though documentation, articles, or other blogs out there have shown for-loop is slow when dealing with big data sets.
When we got hired as Junior Data analysts or Data Scientists documentation and programming tricks are less important than doing our job as well as possible with all tools we know at that moment, but when it is time to improve performance or face bigger challenges we got shocked. Here I am going to explain my last experience as a consultant where I had to improve the performance of a Python code chunk that took almost 5 minutes to run using for-loop and less than 1 minute with vectorization.
Sources
First of all, I want to share with you some documentation and links that helped me years ago. When I was dealing with DataFrames and arrays, this article written by @Tirthajyoti Sarkar astonished me due to the easy way of Numpy and vectorization; 10 minutes to pandas is a good way to understand the basis for pandas. Last but not least, this youtube video explains clearly data transformation workflow with Pandas and Numpy.
Improving performance
As a consultant at a Mexican banking institution, my team and I had to automate and turn Excel formulas into a Python code that run a process to determine credit risk, before this automation people from the Bank took almost one week to perform this task. We finished implementation 2 weeks ago, but then we focused on enhancing time execution. The whole process runs in 10 minutes however, we found out that a code chunk takes almost 50% of the time execution so we decided to go deeper to get it faster. I am going to show you what the “problem” was in our project with a Data Frame created with random values, let’s start.
Code example
We need 3 different types of columns:
- float_columns: Random float values between 0 and 1.
- string_columns: Either of the list in the code below where float_columns are not 0s.
- amount_columns: Values between 0 and 1000.
The code to create that DataFrame is as follows in the Python function below.
def crate_df(num_rows, zero_percentage, num_columns):
# Generate random float values between 0 and 1
values = np.random.rand(num_rows, num_columns)
# Set % of the values to 0
mask = np.random.choice([0, 1], size=(num_rows, num_columns), p=[zero_percentage, 1 - zero_percentage])
values = values * mask
# Shuffle the values within each column
np.random.shuffle(values)
# Create a DataFrame from float values
df = pd.DataFrame(values, columns = float_columns)
# Create string columns
str_values = ["Acciones IPC", "Acciones y otros valores", "Prenda", "Inmuebles", "Dinero y valores" ,"Ent con Gar", "Der cobro",
"Deuda Emisores", "Fideic", "Hipoteca", "Ap Federales", "Depositos", "Gar Expresa", "Fiduciarios", "AVAL"]
str_ln = len(str_values)
new_cols = np.where(df[float_columns] == 0, "", np.random.choice(str_values, size=(num_rows, num_columns), p = [1 / str_ln ] * str_ln ) )
# Append string columns in the DataFrame
df[string_columns] = new_cols
# Create bigger float values
bigger_values = (values * np.random.rand(num_rows, num_columns) ) * 1000
# Append bigger columns in the DataFrame
df[amount_columns] = bigger_values
return df
Here is the code chunk that was programmed in our project before our optimization.
def for_loop_execution(df):
best_gar = [f"best_gar{i}" for i in range(1,6)] # new float columns
cols = np.array(float_columns) # existing columns
best_type = [f"best_type{i}" for i in range(1,6)] # new text columns
type_cols = np.array(string_columns) # existing columns
best_value = np.array([f"best_value{i}" for i in range(1,6)]) # new float columns
value_cols = np.array(amount_columns) # existing columns
for i_r in range(df.shape[0]):
xs = get_sorted(row_values = df.loc[i_r, cols], core="for") # Sorted values with 0s in the last positions
# BEST GAR
df.loc[i_r, best_gar] = list(df.loc[i_r, cols[xs]])
# BEST TYPE GAR
df.loc[i_r, best_type] = list(df.loc[i_r, type_cols[xs]])
# BEST TYPE GAR
df.loc[i_r, best_value] = list(df.loc[i_r, value_cols[xs]])
aux = np.where(df.loc[i_r, best_type] != "AVAL")[0]
df.loc[i_r, "covered_amount"] = df.loc[i_r, best_value[aux]].sum()
return df
Using vectorization
def vectorization_execution(df):
best_gar = [f"best_gar{i}" for i in range(1,6)] # new float columns
cols = float_columns # existing columns
best_type = [f"best_type{i}" for i in range(1,6)] # new text columns
type_cols = string_columns # existing columns
best_value = [f"best_value{i}" for i in range(1,6)] # new float columns
value_cols = amount_columns # existing columns
xs = get_sorted(row_values = df[cols].values ) # Sorted values with 0s in the last positions
# BEST GAR
df[best_gar] = np.take_along_axis(df[cols].values, xs, axis=1)
# BEST TYPE GAR
df[best_type] = np.take_along_axis(df[type_cols].values, xs, axis=1)
# BEST VALUE
df[best_value] = np.take_along_axis(df[value_cols].values, xs, axis=1)
# COVERED AMOUNT FOR NON FINANCIAL GAR
rc = np.where(df[best_type] == "AVAL", 0, df[best_value])
df["covered_amount"] = rc.sum(axis=1)
return df
If you noticed, both codes above are using the function get_sorted, which was also modified despite performing the same process all because the Python Zen:
Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
…
def get_sorted(row_values, core = "vectorization"):
if core != "vectorization":
x_sorted = row_values
is_zero = np.where(x_sorted == 0)[0]
x_sorted[is_zero] = np.inf
else:
x_sorted = np.where(row_values == 0.0, np.inf, row_values)
return np.argsort(x_sorted)
As we saw, vectorization code not only looks more simple, it’s faster. Now it's time to try them.
Let’s try a data frame with 200 rows.
Vectorization: --- 0.0078582763671875 seconds ---
For Loop: --- 1.5013961791992188 seconds ---
Now with 2K rows, we see vectorization approximately 1.5K times faster than for-loop.
Vectorization: --- 0.01059412956237793 seconds ---
For Loop: --- 16.272381067276 seconds ---
Lastly, we set a data frame with 40K rows which is almost the quantity of records of our real project.
Vectorization: --- 0.05539679527282715 seconds ---
For Loop: --- 334.3119812011719 seconds ---
Conclusion
Vectorization enhanced performance against for-loop, despite having experience with Python is so important to bear in mind the basis, documentation, and blogs to know how to deal with these kinds of problems and why not ask ChatGPT how to improve a loop. Thank you for having read this article, and do not forget “Simple is better than complex”.
If you have any questions, don’t hesitate to contact me. You can also explore my GitHub repositories for more code snippets. If you share the same passion for data science/data engineering, feel free to connect with me on LinkedIn or follow me on Twitter.