Code Climate's duplication engine, at its heart, uses a fairly simple algorithm to decide which parts of your code are duplicated. First, source files are parsed into abstract syntax trees (ASTs). ASTs are made up of nodes: each node has a type (which might be something like "variable assignment" or "if statement"), and a set of values (which might be a variable name or another AST node representing a sub-expression). Each node is assigned a numerical mass, calculated as the number of sub-nodes plus 1
When looking for duplication, each node in each AST is compared to every other node in the parsed ASTs. Nodes have the same structure if they have the same type as each other, and if all of their sub-nodes also have the same structure. Nodes that have the same structure but different values (like 2 "variable assignment" nodes assigning variables with different names and types), are considered "similar". Nodes that have the same structure, and also have the same values are considered "identical".
The duplication engine has a mass threshold for each language it analyzes (this mass threshold can also be customized in your
.codeclimate.yml). When the duplication engine finds two nodes that are "similar" or "identical" as described above, and the nodes in question have a mass that is equal to or greater than the configured mass threshold, the engine emits an issue describing those nodes as being duplicates of each other (either "similar" or "identical" duplicates).
There are two Code Climate maintainability checks for duplication:
similar-code. There are language-specific default thresholds for each. Tuning the default threshold will override the defaults for all languages.
version: "2" # required to adjust maintainability checks checks: similar-code: config: threshold: #overriding will affect all languages identical-code: config: threshold: #overriding will affect all languages
Learn more about configuring individual maintainability checks in our section on Advanced Configuration.
Updated 4 years ago