New Threading Mode

Urs Janßen

2005-06-20 12:14:13 UTC

--/9DWx/yDrRhgMJTb
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

[note: fullquote as original article was rejected (non-member
submission); updated patch at the end]

I have attached a patch which adds a new threading mode. I call this
threading mode "Percentage Match". It simply threads together all the
articles which have a subject which is similar enough to each other. The
level of similarity is decided based upon a character by character check
and is configurable.
This threading mode is useful in groups where the Multipart
threading mode doesn't work due to posters following a different naming
convention or where the articles obviously go together, as in image sets,
but have different filename embedded within the subject.
I find that this threading mode works quite well in large groups,
especially where the archive contains more than one file spread through
several multipart articles. The performance compared to Multipart
threading is significantly better. A realworld group containing 1.1
million articles was threaded down to approximately 1400 in 40 seconds on
an AMD Sempron 2400+ while the same group took well over 40 minutes via
Multipart threading.
As you will notice I placed this threading mode ahead of Multipart
threading to keep the threading modes in approximately increasing
processing requirements.

which is a bad idea as users have to adjust thier tinrc/attributes
file to get back thier thread_arts preferece (if they used
multipart/subject). I've placed the new threading method at the end
so numbers don't change.

If there are any questions or problems regarding this patch please
feel free to contact me.

as the threading method can be chosen on a per group basis via
attributes it IMHO would make sense to allow a per grou basis
percent limit for the new threading method.

here is an updated version which fixes the method numbering, does
some sanity checks and minor improvemts.

urs
--
"Only whimps use tape backup: _real_ men just upload their important stuff
on ftp, and let the rest of the world mirror it ;)" - Linus

--/9DWx/yDrRhgMJTb
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="thread_patch.n"

diff -Nur tin-1.7.9/doc/tin.1 tin-1.7.9.new/doc/tin.1
--- tin-1.7.9/doc/tin.1 2005-06-12 09:48:07.908128000 +0200
+++ tin-1.7.9.new/doc/tin.1 2005-06-20 11:47:36.435347018 +0200
@@ -2196,12 +2196,19 @@
\&''Subject:'' (default).
.IP 4
\fBMultipart Subject\fP, thread multipart articles on ''Subject:''.
+.IP 5
+\fBPercentage Match\fP, thread base upon a partial character match on
+\&''Subject:''.
.RE
.TP
.B Catchup thread by using left key (thread_catchup_on_exit)
If ON catchup group/thread when leaving with the left arrow key. Default is
ON.
.TP
+.B Matchingness of a thread (thread_perc)
+How closely the subjects must match for two threads to be considered
+part of the same thread. This is a percentage and the default if 75%.
+.TP
.B Score of a thread (thread_score)
How the total score of a thread is computed. Default is 0, the maximum
score in this thread.
diff -Nur tin-1.7.9/doc/tin.5 tin-1.7.9.new/doc/tin.5
--- tin-1.7.9/doc/tin.5 2005-06-07 23:59:41.000000000 +0200
+++ tin-1.7.9.new/doc/tin.5 2005-06-20 11:48:46.362522633 +0200
@@ -1929,10 +1929,16 @@
0) Don't thread, 1) Thread on Subject only 2) Thread on References only,
3) Thread on References then Subject (default)
4) Thread multipart articles on Subject.
+5) Thread on Percentage Match of the Subjects
It's also possible to set the threading type on a per group basis by setting
-the group attribute variable \fBthread_arts\fP to 0 - 4 in the file
+the group attribute variable \fBthread_arts\fP to 0 - 5 in the file
\fI${TIN_HOMEDIR\-"$HOME"}/.tin/attributes\fR.
.TP
+.B thread_perc
+Defines how close the subjects must match while threading by Percentage
+Match for threads to be considered part of a single thread. This value
+is in the range 0 to 100. The default is 75.
+.TP
.B thread_catchup_on_exit
If ON catchup group/thread when leaving with the left arrow key. Default is
ON.
diff -Nur tin-1.7.9/include/extern.h tin-1.7.9.new/include/extern.h
--- tin-1.7.9/include/extern.h 2005-06-07 23:59:41.000000000 +0200
+++ tin-1.7.9.new/include/extern.h 2005-06-20 11:34:16.762014849 +0200
@@ -1567,6 +1567,7 @@
extern struct opttxt txt_tab_goto_next_unread;
extern struct opttxt txt_tex2iso_conv;
extern struct opttxt txt_thread_articles;
+extern struct opttxt txt_thread_perc;
extern struct opttxt txt_thread_catchup_on_exit;
extern struct opttxt txt_thread_score;
extern struct opttxt txt_underscores_regex;
diff -Nur tin-1.7.9/include/tin.h tin-1.7.9.new/include/tin.h
--- tin-1.7.9/include/tin.h 2005-06-07 23:59:41.000000000 +0200
+++ tin-1.7.9.new/include/tin.h 2005-06-20 12:00:51.505482545 +0200
@@ -1136,8 +1136,11 @@
#define THREAD_REFS 2
#define THREAD_BOTH 3
#define THREAD_MULTI 4
+#define THREAD_PERC 5

-#define THREAD_MAX THREAD_MULTI
+#define THREAD_MAX THREAD_PERC
+
+#define THREAD_PERC_DEFAULT 75

/*
* Values for show_author
diff -Nur tin-1.7.9/include/tinrc.h tin-1.7.9.new/include/tinrc.h
--- tin-1.7.9/include/tinrc.h 2005-06-07 23:59:41.000000000 +0200
+++ tin-1.7.9.new/include/tinrc.h 2005-06-20 11:35:28.071938650 +0200
@@ -141,6 +141,7 @@
int sort_threads_type; /* method used to sort base[] */
int strip_bogus;
int thread_articles; /* threading system for viewing articles */
+ int thread_perc; /* how close the match needs to be for THREAD_PERC to recognize two articles as the same thread */
int thread_score; /* how the score for threads is computed*/
int wildcard; /* 0=wildmat, 1=regex */
int score_limit_kill; /* score limit to kill articles */
diff -Nur tin-1.7.9/src/art.c tin-1.7.9.new/src/art.c
--- tin-1.7.9/src/art.c 2005-06-17 16:08:14.528851604 +0200
+++ tin-1.7.9.new/src/art.c 2005-06-20 12:15:15.474961229 +0200
@@ -85,8 +85,9 @@
static t_bool parse_headers(FILE *fp, struct t_article *h);
static t_compfunc eval_sort_arts_func(unsigned int sort_art_type);
static void sort_base(unsigned int sort_threads_type);
-static void thread_by_subject(void);
static void thread_by_multipart(void);
+static void thread_by_percentage(void);
+static void thread_by_subject(void);

/*
@@ -720,6 +721,96 @@
#endif /* 0 */
}

+/*
+ * This Threading algorithm threads articles into 'buckets' where each bucket
+ * contains all the articles which match the root article's subject line to
+ * the configured percentage. Eg, if the root article had the subject "asdf"
+ * and the match percentage was configured to be 75% then any article would
+ * match if its subject was no different in more than a single character.
+ */
+static void
+thread_by_percentage(
+ void)
+{
+ int i, j, k;
+ int root_num = 0; /* The index number of the root we are currently working on. */
+ int unmatched; /* This is the number of characters that don't match between the two strings */
+ unsigned int percentage = 100 - tinrc.thread_perc;
+ int length_diff;
+
+ /* First we need to sort art[] to simplify and speed up the matching. */
+ SortBy(subj_comp_asc);
+
+ /*
+ * Now we put all the articles which match enough into the thread. If
+ * an article doesn't match enough we create a new thread and then add
+ * to that and so on.
+ */
+ base[0] = 0;
+ arts[0].prev = ART_NORMAL;
+ for_each_art(i) {
+ if (i == 0)
+ continue;
+
+ /* Check each character to see if it matched enough */
+ k = 0;
+ unmatched = 0;
+ for (j = 0; arts[base[root_num]].subject[j] != '\0' && arts[i].subject[k] != '\0'; j++) {
+ if (arts[base[root_num]].subject[j] == arts[i].subject[k]) {
+ /* The characters match up. So we move onto the next*/
+ k++;
+ continue;
+ }
+
+ /*
+ * So the characters didn't match up and a character
+ * wasn't inserted. So we'll just ignore these two
+ * characters and see if they were just differing, but
+ * don't through the alignment out. We need to keep
+ * track of this character as a difference.
+ */
+ k++;
+ unmatched++;
+ }
+
+ /*
+ * By getting here we have a number of unmatched characters
+ * between the two strings. We also have the length of the
+ * strings available to us easily.
+ * All we need to do is see if the match is good enough, but
+ * we count differences in the length of the strings against
+ * them matching.
+ */
+
+ length_diff = strlen(arts[base[root_num]].subject) - strlen(arts[i].subject);
+ /* ensure that it's positive */
+ if (length_diff < 0)
+ length_diff = -1 * length_diff;
+
+ unmatched += length_diff;
+ if ((unmatched * 100) / strlen(arts[base[root_num]].subject) > percentage) {
+ /*
+ * If there is less greater than percentage% different
+ * start a new thread.
+ */
+ root_num++;
+ base[root_num] = i;
+ arts[i].prev = ART_NORMAL;
+ continue;
+ } else {
+ /*
+ * The subject lines match enough to consider them part
+ * of a single thread, so add the current article to
+ * the thread.
+ */
+ if (arts[base[root_num]].thread < 0)
+ arts[base[root_num]].thread = i;
+ arts[i].prev = i - 1;
+ arts[i - 1].thread = i;
+ continue;
+ }
+ }
+}

/*
* This was brought over from tags.c, however this version doesn't not
@@ -927,6 +1018,7 @@
* THREAD_REFS Threads are created using the References headers
* THREAD_BOTH Threads created using References and then Subject
* THREAD_MULTI Threads created using Subject to search for Multiparts
+ * THREAD_PERC Threads based upon a char for char match of greater than x%
*
* .thread and .prev are used to hold the threading information, see tin.h for
* more information
@@ -1021,6 +1113,10 @@
thread_by_multipart();
break;

+ case THREAD_PERC:
+ thread_by_percentage();
+ break;
+
default: /* not reached */
break;
}
diff -Nur tin-1.7.9/src/config.c tin-1.7.9.new/src/config.c
--- tin-1.7.9/src/config.c 2005-06-10 22:53:02.508897000 +0200
+++ tin-1.7.9.new/src/config.c 2005-06-20 12:00:48.345062658 +0200
@@ -704,6 +704,9 @@
if (match_integer(buf, "thread_articles=", &tinrc.thread_articles, THREAD_MAX))
break;

+ if (match_integer(buf, "thread_perc=", &tinrc.thread_perc, 100))
+ break;
+
if (match_integer(buf, "thread_score=", &tinrc.thread_score, THREAD_SCORE_WEIGHT))
break;

@@ -925,6 +928,9 @@
fprintf(fp, _(txt_thread_articles.tinrc));
fprintf(fp, "thread_articles=%d\n\n", tinrc.thread_articles);

+ fprintf(fp, _(txt_thread_perc.tinrc));
+ fprintf(fp, "thread_perc=%d\n\n", tinrc.thread_perc);
+
fprintf(fp, _(txt_show_description.tinrc));
fprintf(fp, "show_description=%s\n\n", print_boolean(tinrc.show_description));

@@ -1683,6 +1689,7 @@
t_bool show_lines = FALSE;
t_bool show_score = FALSE;
t_bool thread_articles = FALSE;
+ t_bool thread_perc = FALSE;
t_bool use_builtin_inews = FALSE;
t_bool use_getart_limit = FALSE;
t_bool use_mailreader_i = FALSE;
@@ -1785,6 +1792,9 @@
if (thread_articles)
tinrc.thread_articles = THREAD_BOTH;

+ if (thread_perc)
+ tinrc.thread_perc = THREAD_PERC_DEFAULT;
+
if (use_builtin_inews)
strncpy(tinrc.inews_prog, INTERNAL_CMD, sizeof(tinrc.inews_prog) - 1);

diff -Nur tin-1.7.9/src/init.c tin-1.7.9.new/src/init.c
--- tin-1.7.9/src/init.c 2005-06-17 16:05:08.489045345 +0200
+++ tin-1.7.9.new/src/init.c 2005-06-20 12:00:34.828543703 +0200
@@ -291,6 +291,7 @@
SORT_THREADS_BY_SCORE_DESCEND, /* sort_threads_type */
BOGUS_SHOW, /* strip_bogus */
THREAD_BOTH, /* thread_articles */
+ THREAD_PERC_DEFAULT, /* thread_perc */
THREAD_SCORE_MAX, /* thread_score */
0, /* Default to wildmat, not regex */
-50, /* score_limit_kill */
diff -Nur tin-1.7.9/src/lang.c tin-1.7.9.new/src/lang.c
--- tin-1.7.9/src/lang.c 2005-06-07 23:59:45.000000000 +0200
+++ tin-1.7.9.new/src/lang.c 2005-06-20 11:46:21.581075549 +0200
@@ -1095,7 +1095,8 @@
N_("Subject"),
N_("References"),
N_("Both Subject and References"),
- N_("Multipart Subject")
+ N_("Multipart Subject"),
+ N_("Percentage Match")
};

/*
@@ -1415,7 +1416,21 @@
# 1 = Subject\n\
# 2 = References\n\
# * 3 = Both (Subject and References)\n\
-# 4 = Multipart Subject\n")
+# 4 = Multipart Subject\n\
+# 5 = Percentage Match\n")
+};
+
+struct opttxt txt_thread_perc = {
+ N_("Enter percentage match required to thread together. <CR> sets."),
+ N_("Thread percentage match"),
+ N_("# Thread percentage match...\n\
+# the percentage of characters in the subject of an article that must match a\n\
+# base article for both those articles to be considered to belong to the same\n\
+# thread. This option is an integer percentage, eg. 80, no decimals may follow.\n\
+# If 80 is used here, then 80%% of the characters must match exactly, no\n\
+# insertion of a character, for the two articles to be put in the same thread.\n\
+# eg. 'happy' and 'harpy' would match, but 'harpie', 'happie' and 'harppy'\n\
+# would be threaded separately from 'happy'\n")
};

struct opttxt txt_thread_score = {
diff -Nur tin-1.7.9/src/options_menu.c tin-1.7.9.new/src/options_menu.c
--- tin-1.7.9/src/options_menu.c 2005-06-08 17:16:03.591669000 +0200
+++ tin-1.7.9.new/src/options_menu.c 2005-06-20 12:06:27.541808833 +0200
@@ -1439,6 +1439,12 @@
redraw_screen(option);
break;

+ case OPT_THREAD_PERC:
+ prompt_option_num(option);
+ if (tinrc.thread_perc < 0 || tinrc.thread_perc > 100)
+ tinrc.thread_perc = THREAD_PERC_DEFAULT;
+ break;
+
case OPT_WRAP_COLUMN:
prompt_option_num(option);
/* recook if in an article is open */
diff -Nur tin-1.7.9/src/tincfg.tbl tin-1.7.9.new/src/tincfg.tbl
--- tin-1.7.9/src/tincfg.tbl 2005-06-07 23:59:43.000000000 +0200
+++ tin-1.7.9.new/src/tincfg.tbl 2005-06-20 11:38:12.935709163 +0200
@@ -43,6 +43,7 @@
inverse_okay OPT_ON_OFF
strip_blanks OPT_ON_OFF
thread_articles txt_threading THREAD_MAX+1
+ thread_perc OPT_NUM
sort_article_type txt_sort_a_type SORT_ARTICLES_BY_LINES_ASCEND+1
sort_threads_type txt_sort_t_type SORT_THREADS_BY_SCORE_ASCEND+1
pos_first_unread OPT_ON_OFF

--/9DWx/yDrRhgMJTb--