Too Large cluster size #40

apratim1 · 2017-09-20T10:18:17Z

Selecting Edit Distance with a maximum of 2 gives us this bug.

hangal · 2017-09-20T10:34:33Z

unfortunately, we are computing the transitive closure of all strings within edit-distance 2, so a single cluster contains very unrelated names. Any ideas to resolve this?
Will splitting by constituency help?

Sudx-old-gamer-new-coder · 2017-09-20T12:08:19Z

@hangal Splitting by constituency may work on small states but is not useful on large ones.

I am not sure whether it's doable or not with the current implementation. I do have a suggestion.

If our cluster size becomes greater than a certain threshold value (that we decide). We can try finding and then removing the common key words in the group names. Like 'Patel' for eg. And then create sub-clusters recursively for that cluster.

hangal · 2017-09-20T12:42:51Z

that's not a bad idea.

hangal · 2017-10-26T11:56:34Z

@sudesh-ashoka , do the new controls for the compat. alg. address this problem?
let's say you make min token overlap = 3?

Sudx-old-gamer-new-coder · 2017-10-26T14:50:00Z

hp_worksheet.csv.zip
@hangal I tested the new build of Surf and played around with the settings of Compatible Names. But I wasn't able to reduce the size of group substantially.
I used Himachal AE data. PFA.

Following are the names I looked for.

BEERU RAM | | | | 2 | M | 23 | 0 | 9 | GEHARWIN | 3 | INC | 1998

BEERU RAM KISHORE | SC | | | 2 | M | 23 | 0 | 11 | GEHARWIN | 3 | INC | 2007
BEERU RAM KISHORE | | | | 1 | M | 23 | 0 | 8 | GEHARWIN | 3 | INC | 1993
BEERU RAM KISORE | | | | 2 | M | 23 | 0 | 7 | GEHARWIN | 3 | INC | 1990

Sudx-old-gamer-new-coder changed the title ~~Bug~~ Too Large cluster size Sep 20, 2017

hangal self-assigned this Oct 7, 2017

hangal added the P1 - High Priority label Oct 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too Large cluster size #40

Too Large cluster size #40

apratim1 commented Sep 20, 2017

hangal commented Sep 20, 2017

Sudx-old-gamer-new-coder commented Sep 20, 2017

hangal commented Sep 20, 2017

hangal commented Oct 26, 2017

Sudx-old-gamer-new-coder commented Oct 26, 2017

Too Large cluster size #40

Too Large cluster size #40

Comments

apratim1 commented Sep 20, 2017

hangal commented Sep 20, 2017

Sudx-old-gamer-new-coder commented Sep 20, 2017

hangal commented Sep 20, 2017

hangal commented Oct 26, 2017

Sudx-old-gamer-new-coder commented Oct 26, 2017