Guido

Don't grow your pandas DataFrames row by row

January 16, 2025

Guido Carugati

When working with pandas, one frequent task is having to fill a DataFrame row by row. Recently, I had to perform this operation with a HUGE dataset, and to my surprise (or not), it turned out to be extremely time-consuming. This led me to dig a little deeper, and I arrived at a key conclusion: Avoid growing a Pandas DataFrame row by row.

I’m pretty sure many of us in this scenario instinctively start by creating an empty DataFrame and then filling it iteratively. However, this approach is extremely inefficient and should be avoided. A more efficient and performant way involves accumulating our data in a list (or any similar data structure) and only converting it into a DataFrame once all processing is complete.

Collect data first

data = []
# Collect data in the list
for row in function_that_generates_data():
	data.append(row)
# Convert the list to a DataFrame in one go
df = pd.DataFrame(data)

Advantages of this method:

  • Efficiency: Appending to a Python list is much faster than appending to a pandas DataFrame. It consumes much less memory because lists don’t involve frequent memory reallocation when expanded.
  • Automatic dtype inference: When converting a Python list to a DataFrame, pandas automatically infers column data types, avoiding the need to do it manually.
  • Simpler indexing: pandas automatically assigns a default RangeIndex when the DataFrame is created eliminating the need to manually track row indices during data collection.

Common practices to avoid

Using .concat within a loop

This is probably the first approach many of us would consider doing. Which is using .concat() in every iteration of a loop.

df = pd.DataFrame()
for row in function_that_generates_data():
df = pd.concat([df, pd.DataFrame([row])], ignore_index=True)

Why is it bad? Each call to .concat triggers a memory reallocation. Unlike Python lists, which append elements in constant time, pandas DataFrames don’t have this optimization.

When concatenating DataFrames, the memory of the existing DataFrame is duplicated, and the new row is added to this duplicated structure. This behavior, combined with a loop, results in a quadratic time complexity, making the operation much slower as the dataset grows.

Using .loc for row assignment

Another approach could be creating an empty DataFrame and then populating it iteratively using .loc

df = pd.DataFrame(columns=["column1", "column2", "column3"])
for row in function_that_generates_data():
	df.loc[df.index.max() + 1] = row

Why is it bad? Like .concat, using this method within a loop also produces frequent memory reallocation. A DataFrame created using .loc row assignment has no predefined size, so again pandas resizes it dynamically for every newly added row.

Also, note that we are using df.index.max() to determine the next index for .loc. This is slightly faster than df.loc[len(df)] = row, but is still inefficient.

Starting from a DataFrame filled with NaNs

A workaround for the previous approach could be preallocating a DataFrame filled with NaN values and then populating it using .loc.

df = pd.DataFrame(columns=["column1", "column2", "column3"], index=range(num_rows))
for i, row in enumerate(function_that_generates_data()):
	df.loc[i] = row

Why is it bad? First, this approach is only suitable if the Dataframe’s final size is known upfront, which is rarely the case. If the size needs to change during processing, we would have to manually resize the DataFrame, reintroducing the inefficiencies commented above.

Additionally, initializing a DataFrame filled with NaN values causes all its columns to default to the object data type:

column1 object
column2 object
column3 object
dtype: object

Obviously, this is problematic because it implies having to perform additional type conversion steps, which is not what we want.

Nice story. Now show me the data.

To measure these methods I wrote a simple benchmark you can find here. These are the results for each (time in seconds):

┌───────────────────────┬────────────┬───────────────┬─────────────────┐
│ Method                │ 1,000 rows │ 10,000 rows   │ 100,000 rows    │
├───────────────────────┼────────────┼───────────────┼─────────────────┤
│ List of dicts         │ 0.001      │ 0.004         │ 0.037           │
├───────────────────────┼────────────┼───────────────┼─────────────────┤
│ .concat()             │ 0.268      │ 3.084         │ 65.404          │
├───────────────────────┼────────────┼───────────────┼─────────────────┤
│ .loc (without alloc)  │ 0.500      │ 5.501         │ 158.228         │
├───────────────────────┼────────────┼───────────────┼─────────────────┤
│ .loc (with alloc)     │ 0.040      │ 0.392         │ 3.969           │
└───────────────────────┴────────────┴───────────────┴─────────────────┘

I went a little further and tested everything with perfplot. This was the output:

Comparison of Methods

Note about pd.append(): pandas developers recently decided to deprecate the .append method. The original proposal cited that the method was inefficient and was drawing a misleading analogy to Python’s list.append() since both methods have significant differences regarding performance. The original issue and the discussion about it can be found here, which by the way turned out to be pretty comic and heated.