Conversation

cachemeifyoucan

Add a new API to ObjectStore that can import a cas tree from another CAS. The two ObjectStores don't have to share the same hashing algorithm since all the objects will be rehashed and inserted into the new database.

if (this == &Upstream)
return Other;

// FIXME: This replicates the logic in `OnDiskGraphDB::importFullTree`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any intention to unify these in the future? I'm not clear if we need importFullTree or it could just use this function instead.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see OnDiskGraphDB::importFullTree can do better when importing, knowing the upstream CAS has the exact same hash and schema. The answer is maybe but maybe not if we want full optimization.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said. I need to update the comments anyway since I copy them from the other function and some do not apply here. I updated FIXME to be more of an explanation instead.

Add a new API to ObjectStore that can import a cas tree from another
CAS. The two ObjectStores don't have to share the same hashing
algorithm since all the objects will be rehashed and inserted into the
new database.
@cachemeifyoucancachemeifyoucan force-pushed the eng/PR-cas-import-tree branch from 487eced to 9292e05 Compare June 13, 2025 20:40

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably add a unit test that imports between two CAS with different hash schema

@@ -138,6 +141,8 @@ int main(int Argc, char **Argv) {
clEnumValN(MergeTrees, "merge", "merge paths/cas-ids"),
clEnumValN(GetCASIDForFile, "get-cas-id", "get cas id for file"),
clEnumValN(Import, "import", "import objects from another CAS"),
clEnumValN(ImportFromUpstream, "import-from-upstream",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should replace the implementation of --import instead? The existing --import relies on the hash being the same (for an early out check on existing objects) but doesn't actually validate they are the same hash. It's not clear we need both of these APIs to be available on llvm-cas

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought --import is a test option for chained database to exercise that code path so I should probably not replace it.

I can probably take the name from it since I will expect simple name like --import is better used to exercise public CAS APIs. I can maybe move the current import option into llvm-cas-test utility.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current --import does not use chaining, does it? Seems like it works the same as this new functionality except it is recursive. It's not clear we need to keep it.

// Check if the node exists already.
auto CurrentID = Cur.Refs.front();
Cur.Refs.pop_front();
auto Ref = getReference(Upstream.getID(CurrentID));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream.getID(CurrentID) could be a different hash schema than this->getReference. While a collision is probably unlikely in practice, it could cause issues if the hash size differs. We could possibly keep this fast path if we add a check that the CASContext matches?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. I have tested importing a different casID schema but not with a different hashing algorithm. I think that is doable (I can switch SHA1 for the testing plugin dylib implementation in LLVM).

We can also go the other way that we only support hashing with the same schema. I was ambitiously thinking I can even change CAS schema when creating a reproducer when CAS is used but that is actually not possible in current state so I can dial back this implementation. The benefit of same schema import is that it can be more efficient.

Sign up for free to join this conversation on . Already have an account? Sign in to comment
None yet
None yet

Successfully merging this pull request may close these issues.