Documentation ¶
Overview ¶
The identify_license program tries to identify the license type of an unknown license. The file containing the license text is specified on the command line. Multiple license files can be analyzed with a single command. The type of the license is returned along with the confidence level of the match. The confidence level is between 0.0 and 1.0, with 1.0 indicating an exact match and 0.0 indicating a complete mismatch. The results are sorted by confidence level.
$ identifylicense LICENSE1 LICENSE2 LICENSE2: MIT (confidence: 0.987) LICENSE1: BSD-2-Clause (confidence: 0.833)
The license_serializer program normalizes and serializes the known licenseclassifier licenses into a compressed archive. The hash values for the licenses are calculated and added to the archive. These can then be used to determine where in unknown text is a good offset to run through the Levenshtein Distance algorithm.
The license_word_count program counts the frequency of words as they appear in the known licenses. This information is useful if we want to be more selective about which files we run through the license classifier.