[CUDA] baddmm should fall back to addmm for batch=1 by malfet · Pull Request #114992 · pytorch/pytorch ·

I.e. it feels reasonable to always call at::cuda::gemm rather than at::cuda::bgemm when num_batches == 1
After the change, benchmarking torch built with CUDA-12 using following perf script on A100 are as follows:

Shape	bmm_time	mm_time	slow down (%)
1x1x4096	14.18	14.31	-0.89
1x1x8192	14.37	14.37	-0.05
1x1x16384	14.03	14.12	-0.68
1x1x32768	14.19	14.24	-0.35
1x1x65536	14.85	14.52	2.30
1x1x131072	14.03	14.07	-0.33
128x128x128	11.34	11.06	2.56
256x256x256	14.85	14.40	3.15
512x512x512	27.22	27.22	-0.01
1024x1024x1024	129.66	129.50	0.12
2048x2048x2048	972.18	973.24	-0.11
129x127x129	11.21	11.25	-0.39
257x255x257	14.50	14.43	0.44
513x511x513	29.01	29.01	0.01
1025x1023x1025	137.65	137.64	0.01
2049x2047x2049	982.58	982.65	-0.01
4097x3x4097	86.65	86.64	0.01
8193x3x8193	384.02	383.96	0.02
16385x3x16385	1106.73	1107.32	-0.05
32769x3x32769	4739.49	4739.48	0.00
65537x3x65537	17377.78	17378.74	-0.01
4097x5x4097	87.09	87.12	-0.03
8193x5x8193	301.38	301.36	0.01
16385x5x16385	1107.38	1108.04	-0.06
32769x5x32769	4743.73	4744.07	-0.01
65537x5x65537	17392.32	17395.42	-0.02
4097x7x4097	87.17	87.19	-0.02
8193x7x8193	301.94	302.00	-0.02
16385x7x16385	1107.17	1106.79	0.03
32769x7x32769	4747.15	4747.13	0.00
65537x7x65537	17403.85	17405.02	-0.01

Fixes perf problem reported in #114911

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114992

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit 8029afc with merge base 453d509 ():

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

trunk / linux-focal-rocm5.7-py3.8 / test (default, 1, 1, linux.rocm.gpu, unstable) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Einsum failure looks like it could be real, I'm guessing it could be because self here might not have a batch dimension? (e.g., torch.baddbmm(torch.randn(32, 32), torch.randn(1, 32, 32), torch.randn(1, 32, 32)))

EDIT: Not sure if that's the issue as squeeze should be a no-op in that case anyway...

Yes, test failures looks related, but I can not reproduce it in locally.
And torch.randn(32, 32).squeeze(0) is a no-op, but torch.rand(1, 32).squeeze(0) is not, so...

Fixes perf problem reported in #114911

Einsum failure looks like it could be real

After some debugging, it is indeed related to a PR, but in some weird way: bfloat16 is not supported by bmm on sm52, but mm works fine, so backwards now works on older GPUs...

Tweaked opinfo limits a bit

@pytorct

@pytorct merge

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

I.e. it feels reasonable to always call `at::cuda::gemm` rather than `at::cuda::bgemm` when num_batches == 1 After the change, benchmarking torch built with CUDA-12 using [following perf script](https://gist..com/malfet/6a17156d7f5663b8b12054a1beff3fe1) on A100 are as follows: | Shape | bmm_time | mm_time | slow down (%) | | -------------- | --------- | --------- | ------------- | | 1x1x4096 | 14.18 | 14.31 | -0.89 | | 1x1x8192 | 14.37 | 14.37 | -0.05 | | 1x1x16384 | 14.03 | 14.12 | -0.68 | | 1x1x32768 | 14.19 | 14.24 | -0.35 | | 1x1x65536 | 14.85 | 14.52 | 2.30 | | 1x1x131072 | 14.03 | 14.07 | -0.33 | | 128x128x128 | 11.34 | 11.06 | 2.56 | | 256x256x256 | 14.85 | 14.40 | 3.15 | | 512x512x512 | 27.22 | 27.22 | -0.01 | | 1024x1024x1024 | 129.66 | 129.50 | 0.12 | | 2048x2048x2048 | 972.18 | 973.24 | -0.11 | | 129x127x129 | 11.21 | 11.25 | -0.39 | | 257x255x257 | 14.50 | 14.43 | 0.44 | | 513x511x513 | 29.01 | 29.01 | 0.01 | | 1025x1023x1025 | 137.65 | 137.64 | 0.01 | | 2049x2047x2049 | 982.58 | 982.65 | -0.01 | | 4097x3x4097 | 86.65 | 86.64 | 0.01 | | 8193x3x8193 | 384.02 | 383.96 | 0.02 | | 16385x3x16385 | 1106.73 | 1107.32 | -0.05 | | 32769x3x32769 | 4739.49 | 4739.48 | 0.00 | | 65537x3x65537 | 17377.78 | 17378.74 | -0.01 | | 4097x5x4097 | 87.09 | 87.12 | -0.03 | | 8193x5x8193 | 301.38 | 301.36 | 0.01 | | 16385x5x16385 | 1107.38 | 1108.04 | -0.06 | | 32769x5x32769 | 4743.73 | 4744.07 | -0.01 | | 65537x5x65537 | 17392.32 | 17395.42 | -0.02 | | 4097x7x4097 | 87.17 | 87.19 | -0.02 | | 8193x7x8193 | 301.94 | 302.00 | -0.02 | | 16385x7x16385 | 1107.17 | 1106.79 | 0.03 | | 32769x7x32769 | 4747.15 | 4747.13 | 0.00 | | 65537x7x65537 | 17403.85 | 17405.02 | -0.01 | Fixes perf problem reported in pytorch#114911 Pull Request resolved: pytorch#114992 Approved by: https://.com/Skylion007, https://.com/eqy

I.e. it feels reasonable to always call `at::cuda::gemm` rather than `at::cuda::bgemm` when num_batches == 1 After the change, benchmarking torch built with CUDA-12 using [following perf script](https://gist..com/malfet/6a17156d7f5663b8b12054a1beff3fe1) on A100 are as follows: | Shape | bmm_time | mm_time | slow down (%) | | -------------- | --------- | --------- | ------------- | | 1x1x4096 | 14.18 | 14.31 | -0.89 | | 1x1x8192 | 14.37 | 14.37 | -0.05 | | 1x1x16384 | 14.03 | 14.12 | -0.68 | | 1x1x32768 | 14.19 | 14.24 | -0.35 | | 1x1x65536 | 14.85 | 14.52 | 2.30 | | 1x1x131072 | 14.03 | 14.07 | -0.33 | | 128x128x128 | 11.34 | 11.06 | 2.56 | | 256x256x256 | 14.85 | 14.40 | 3.15 | | 512x512x512 | 27.22 | 27.22 | -0.01 | | 1024x1024x1024 | 129.66 | 129.50 | 0.12 | | 2048x2048x2048 | 972.18 | 973.24 | -0.11 | | 129x127x129 | 11.21 | 11.25 | -0.39 | | 257x255x257 | 14.50 | 14.43 | 0.44 | | 513x511x513 | 29.01 | 29.01 | 0.01 | | 1025x1023x1025 | 137.65 | 137.64 | 0.01 | | 2049x2047x2049 | 982.58 | 982.65 | -0.01 | | 4097x3x4097 | 86.65 | 86.64 | 0.01 | | 8193x3x8193 | 384.02 | 383.96 | 0.02 | | 16385x3x16385 | 1106.73 | 1107.32 | -0.05 | | 32769x3x32769 | 4739.49 | 4739.48 | 0.00 | | 65537x3x65537 | 17377.78 | 17378.74 | -0.01 | | 4097x5x4097 | 87.09 | 87.12 | -0.03 | | 8193x5x8193 | 301.38 | 301.36 | 0.01 | | 16385x5x16385 | 1107.38 | 1108.04 | -0.06 | | 32769x5x32769 | 4743.73 | 4744.07 | -0.01 | | 65537x5x65537 | 17392.32 | 17395.42 | -0.02 | | 4097x7x4097 | 87.17 | 87.19 | -0.02 | | 8193x7x8193 | 301.94 | 302.00 | -0.02 | | 16385x7x16385 | 1107.17 | 1106.79 | 0.03 | | 32769x7x32769 | 4747.15 | 4747.13 | 0.00 | | 65537x7x65537 | 17403.85 | 17405.02 | -0.01 | Fixes perf problem reported in #114911 Pull Request resolved: #114992 Approved by: https://.com/Skylion007, https://.com/eqy Co-authored-by: Nikita Shulga <[email protected]>

malfet requested a review from ngimel December 1, 2023 23:12

malfet added release notes: cudarelease notes categorytopic: performancetopic categorylabels Dec 1, 2023

ptrblck requested a review from eqy December 1, 2023 23:14

Skylion007 approved these changes Dec 1, 2023
View reviewed changes

eqy reviewed Dec 2, 2023
View reviewed changes

malfet added 4 commits December 4, 2023 15:24

[CUDA] baddmm should fall back to addmm for batch=1

ed88889
Fixes perf problem reported in #114911

Update comment

4bd1378

Use _unsafe_view

ac4c016

And tweak opinfo

42ada62

malfet force-pushed the malfet/cuda-fall-back-to-addm branch from 83c4c05 to 42ada62 Compare December 5, 2023 04:05

malfet requested a review from mruberry as a code owner December 5, 2023 04:05

malfet added 2 commits December 6, 2023 00:48

Fix lint

cc43303

Move code around a bit

276bf7e

eqy approved these changes Dec 6, 2023
View reviewed changes

And make it even faster

8029afc

pytorch-bot bot added the ciflow/trunkTrigger trunk jobs on your pull requestlabel Dec 8, 2023

pytorchmergebot added the merging label Dec 8, 2023

pytorchmergebot added the Merged label Dec 8, 2023

pytorchmergebot closed this in 6c585de Dec 8, 2023

pytorchmergebot removed the merging label Dec 8, 2023

malfet deleted the malfet/cuda-fall-back-to-addm branch December 18, 2023 19:03

This was referencedDec 28, 2023
[CUDA] baddmm should fall back to addmm for batch=1 (#114992) #116518
Merged
[v.2.2.0] Release Tracker #115300
Closed

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Conversation

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/114992

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Merge started

Uh oh!

Uh oh!