PERF: read_csv macro updates by WillAyd · Pull Request #52632 · pandas-dev/pandas ·

@Dr-Irv

This gives us a few percentage points back in read_csv performance, especially with larger files. Credit to @Dr-Irv for the idea

@WillAyd

Thanks for the shout-out @WillAyd . Back in the day, I was part of a team writing performance-critical code in C, and we used to do optimizations like this all the time.

If anyone cared for the details here is the verbose annotated assembly output (from -fverbose-asm gcc flag) for before:

.L246:
# pandas/_libs/src/parser/tokenizer.c:929:                 } else if (IS_DELIMITER(c)) {
	.loc 3 929 28 is_stmt 1
	movq	-136(%rbp), %rax	# self, tmp683
	movl	188(%rax), %eax	# self_452(D)->delim_whitespace, _180
# pandas/_libs/src/parser/tokenizer.c:929:                 } else if (IS_DELIMITER(c)) {
	.loc 3 929 27
	testl	%eax, %eax	# _180
	jne	.L247	#,
# pandas/_libs/src/parser/tokenizer.c:929:                 } else if (IS_DELIMITER(c)) {
	.loc 3 929 28 discriminator 1
	movq	-136(%rbp), %rax	# self, tmp684
	movzbl	184(%rax), %eax	# self_452(D)->delimiter, _181
	cmpb	%al, -49(%rbp)	# _181, c
	je	.L248	#,
.L247:
# pandas/_libs/src/parser/tokenizer.c:929:                 } else if (IS_DELIMITER(c)) {
	.loc 3 929 28 is_stmt 0 discriminator 3
	movq	-136(%rbp), %rax	# self, tmp685
	movl	188(%rax), %eax	# self_452(D)->delim_whitespace, _182
	testl	%eax, %eax	# _182
	je	.L249	#,
# pandas/_libs/src/parser/tokenizer.c:929:                 } else if (IS_DELIMITER(c)) {
	.loc 3 929 28 discriminator 4
	call	__ctype_b_loc@PLT	#
	movq	(%rax), %rdx	# *_183, _184
	movsbq	-49(%rbp), %rax	# c, _185
	addq	%rax, %rax	# _186
	addq	%rdx, %rax	# _184, _187
	movzwl	(%rax), %eax	# *_187, _188
	movzwl	%ax, %eax	# _188, _189
	andl	$1, %eax	#, _190
	testl	%eax, %eax	# _190
	je	.L249	#,
.L248:
# pandas/_libs/src/parser/tokenizer.c:930:                     if (self->delim_whitespace) {

and after

.L243:
# pandas/_libs/src/parser/tokenizer.c:930:                 } else if (IS_DELIMITER(c)) {
	.loc 3 930 27 is_stmt 1
	movzbl	-57(%rbp), %eax	# c, tmp657
	cmpb	-41(%rbp), %al	# delimiter, tmp657
	je	.L244	#,
# pandas/_libs/src/parser/tokenizer.c:930:                 } else if (IS_DELIMITER(c)) {
	.loc 3 930 28 discriminator 1
	cmpl	$0, -40(%rbp)	#, delim_whitespace
	je	.L245	#,
# pandas/_libs/src/parser/tokenizer.c:930:                 } else if (IS_DELIMITER(c)) {
	.loc 3 930 28 is_stmt 0 discriminator 2
	call	__ctype_b_loc@PLT	#
	movq	(%rax), %rdx	# *_171, _172
	movsbq	-57(%rbp), %rax	# c, _173
	addq	%rax, %rax	# _174
	addq	%rdx, %rax	# _172, _175
	movzwl	(%rax), %eax	# *_175, _176
	movzwl	%ax, %eax	# _176, _177
	andl	$1, %eax	#, _178
	testl	%eax, %eax	# _178
	je	.L245	#,
.L244:
# pandas/_libs/src/parser/tokenizer.c:931:                     if (self->delim_whitespace) {

Very low level...but this is executed for potentially every character in a file

Can you add a whatsnew?

lithomas1 · 2023-04-13T23:09:20Z

doc/source/whatsnew/v2.1.0.rst

@@ -88,7 +88,7 @@ Other enhancements
 - :meth:`arrays.SparseArray.map` now supports ``na_action`` (:issue:`52096`).
 - Add dtype of categories to ``repr`` information of :class:`CategoricalDtype` (:issue:`52179`)
 - Adding ``engine_kwargs`` parameter to :meth:`DataFrame.read_excel` (:issue:`52214`)
-
+- Performance improvement in :func:`read_csv` (:issue:`52632`)


It's probably worth clarifying that this is for the C engine only.

@WillAyd

Thanks @WillAyd

WillAyd added 3 commits March 11, 2023 10:56

Try simplifying delim macro

4b22dbc

no dereference impl

398a809

Merge branch 'main' into read-csv-perf

29ffe4c

Dr-Irv approved these changes Apr 12, 2023
View reviewed changes

mroeschke added IO CSVread_csv, to_csvPerformanceMemory or execution speed performancelabels Apr 12, 2023

mroeschke added this to the 2.1 milestone Apr 12, 2023

WillAyd added 2 commits April 12, 2023 15:25

whatsnew

1f677f5

Merge branch 'main' into read-csv-perf

3ddb6a3

lithomas1 reviewed Apr 13, 2023
View reviewed changes

Merge remote-tracking branch 'upstream/main' into read-csv-perf

13afee0

mroeschke approved these changes Apr 14, 2023
View reviewed changes

mroeschke merged commit ac15528 into pandas-dev:main Apr 14, 2023

WillAyd deleted the read-csv-perf branch April 17, 2023 17:44

rhshadrach mentioned this pull request Sep 1, 2023
BUG: pandas.read_csv reads in a comma separated file when delim #54918
Closed
3 tasks

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lithomas1 Apr 13, 2023

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Conversation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lithomas1 Apr 13, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!