Code Climate's duplication engine, at its heart, uses a fairly simple algorithm to decide which parts of your code are duplicated. First, source files are parsed into abstract syntax trees (ASTs). ASTs are made up of nodes: each node has a type (which might be something like "variable assignment" or "if statement"), and a set of values (which might be a variable name or another AST node representing a sub-expression). Each node is assigned a numerical mass, calculated as the number of sub-nodes plus 1
When looking for duplication, each node in each AST is compared to every other node in the parsed ASTs. Nodes have the same structure if they have the same type as each other, and if all of their sub-nodes also have the same structure. Nodes that have the same structure but different values (like 2 "variable assignment" nodes assigning variables with different names and types), are considered "similar". Nodes that have the same structure, and also have the same values are considered "identical".
The duplication engine has a mass threshold for each language it analyzes (this mass threshold can also be customized in your
.codeclimate.yml). When the duplication engine finds two nodes that are "similar" or "identical" as described above, and the nodes in question have a mass that is equal to or greater than the configured mass threshold, the engine emits an issue describing those nodes as being duplicates of each other (either "similar" or "identical" duplicates).
Code is identical when all operations & values are identical.
Code is similar when the overall structure is the same, but the particular operations & values under consideration might be different.