New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too Large cluster size #40
Comments
unfortunately, we are computing the transitive closure of all strings within edit-distance 2, so a single cluster contains very unrelated names. Any ideas to resolve this? |
@hangal Splitting by constituency may work on small states but is not useful on large ones. I am not sure whether it's doable or not with the current implementation. I do have a suggestion. If our cluster size becomes greater than a certain threshold value (that we decide). We can try finding and then removing the common key words in the group names. Like 'Patel' for eg. And then create sub-clusters recursively for that cluster. |
that's not a bad idea. |
@sudesh-ashoka , do the new controls for the compat. alg. address this problem? |
hp_worksheet.csv.zip Following are the names I looked for. BEERU RAM | | | | 2 | M | 23 | 0 | 9 | GEHARWIN | 3 | INC | 1998 BEERU RAM KISHORE | SC | | | 2 | M | 23 | 0 | 11 | GEHARWIN | 3 | INC | 2007 |
Selecting Edit Distance with a maximum of 2 gives us this bug.
The text was updated successfully, but these errors were encountered: