Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compatible Name showing under different group #46

Open
Sudx-old-gamer-new-coder opened this issue Oct 25, 2017 · 4 comments
Open

Compatible Name showing under different group #46

Sudx-old-gamer-new-coder opened this issue Oct 25, 2017 · 4 comments

Comments

@Sudx-old-gamer-new-coder
Copy link
Contributor

hp_worksheet.csv.zip

A name is showing under different group using compatible names in Himachal AE.
Please see attached screenshot for details.

Also attaching the worksheet used.

bishan-1
bishan-2

@Sudx-old-gamer-new-coder
Copy link
Contributor Author

The following name also didn't show up in the same group.

Cand | Constituency | Party | Year | pid

DES RAJ | BANIKHET | INC | 1972 | 497
DES RAJ | BANIKHET | INC | 1977 | 497
DES RAJ MAHAJAN | BANIKHET | INC | 1982 | 1244 <-( This row didn't show up in the group)

@hangal
Copy link
Collaborator

hangal commented Oct 26, 2017

compat. names alg has 2 important params:

  1. min token overlap -- if this # of tokens is the same in the 2 names, they are compatible. This is currently set by default to 3 (not 2). Will change it back soon, or make it configurable.

  2. IGNORE_TOKEN_THRESHOLD = 200
    Any token that occurs more than this number of times in the dataset is ignored for the purposes of the above comparison. This is to get rid of noise due to extremely common names like Patel and Singh.
    In the log, look for a string like: Ignored tokens: SING RAM CHAND

hangal pushed a commit that referenced this issue Oct 26, 2017
@hangal
Copy link
Collaborator

hangal commented Oct 26, 2017

Sudesh, pls check now and then close if fixed.
Compat. alg. now has 2 new params that can be controlled by the user.
Pls check the descriptions and suggest improvements to the text if not clear.
Default is 2, 200.

image

With these defaults, your examples work fine:

image

image

@Sudx-old-gamer-new-coder
Copy link
Contributor Author

I had been playing around with the Compatible Name settings. But wasn't able to reduce the size of cluster which contains the following rows.
Could you suggest a setting that works for this?

BEERU RAM | | | | 2 | M | 23 | 0 | 9 | GEHARWIN | 3 | INC | 1998

BEERU RAM KISHORE | SC | | | 2 | M | 23 | 0 | 11 | GEHARWIN | 3 | INC | 2007
BEERU RAM KISHORE | | | | 1 | M | 23 | 0 | 8 | GEHARWIN | 3 | INC | 1993
BEERU RAM KISORE | | | | 2 | M | 23 | 0 | 7 | GEHARWIN | 3 | INC | 1990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants