Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too Large cluster size #40

Open
apratim1 opened this issue Sep 20, 2017 · 5 comments
Open

Too Large cluster size #40

apratim1 opened this issue Sep 20, 2017 · 5 comments
Assignees

Comments

@apratim1
Copy link

screen shot 2017-09-20 at 3 43 05 pm

Selecting Edit Distance with a maximum of 2 gives us this bug.

@hangal
Copy link
Collaborator

hangal commented Sep 20, 2017

unfortunately, we are computing the transitive closure of all strings within edit-distance 2, so a single cluster contains very unrelated names. Any ideas to resolve this?
Will splitting by constituency help?

@Sudx-old-gamer-new-coder Sudx-old-gamer-new-coder changed the title Bug Too Large cluster size Sep 20, 2017
@Sudx-old-gamer-new-coder
Copy link
Contributor

@hangal Splitting by constituency may work on small states but is not useful on large ones.

I am not sure whether it's doable or not with the current implementation. I do have a suggestion.

If our cluster size becomes greater than a certain threshold value (that we decide). We can try finding and then removing the common key words in the group names. Like 'Patel' for eg. And then create sub-clusters recursively for that cluster.

@hangal
Copy link
Collaborator

hangal commented Sep 20, 2017

that's not a bad idea.

@hangal hangal self-assigned this Oct 7, 2017
@hangal
Copy link
Collaborator

hangal commented Oct 26, 2017

@sudesh-ashoka , do the new controls for the compat. alg. address this problem?
let's say you make min token overlap = 3?

@Sudx-old-gamer-new-coder
Copy link
Contributor

hp_worksheet.csv.zip
@hangal I tested the new build of Surf and played around with the settings of Compatible Names. But I wasn't able to reduce the size of group substantially.
I used Himachal AE data. PFA.

Following are the names I looked for.

BEERU RAM |   |   |   | 2 | M | 23 | 0 | 9 | GEHARWIN | 3 | INC | 1998

BEERU RAM KISHORE | SC |   |   | 2 | M | 23 | 0 | 11 | GEHARWIN | 3 | INC | 2007
BEERU RAM KISHORE |   |   |   | 1 | M | 23 | 0 | 8 | GEHARWIN | 3 | INC | 1993
BEERU RAM KISORE |   |   |   | 2 | M | 23 | 0 | 7 | GEHARWIN | 3 | INC | 1990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants