Stuck on an issue?

Lightrun Answers was designed to reduce the constant googling that comes with debugging 3rd party libraries. It collects links to all the places you might be looking at while hunting down a tough bug.

And, if you’re still stuck at the end, we’re happy to hop on a call to see how we can help out.

Algorithm outputs a series of repeated items but there are none in the training data

See original GitHub issue

Hallo,

I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it’s not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2].

When I generated the most frequent sequences, though, I obtained repeated items. Is it possible?

For example, given the code: seqs = [[22, 16], [22, 21], [22, 16, 14, 20], [22, 16], [22, 16, 34, 24, 26, 24, 26, 14, 13], [22, 16], [22, 26], [22, 13, 34], [22, 16], [22, 21, 16]]

ps = PrefixSpan(seqs) ps.minlen = 2 ps.maxlen = 10

freq_ratio = 0.1 freq = np.ceil(freq_ratio * len(seqs)).astype(int)

res = ps.frequent(freq)

The output has [26, 26, 14, 13]

I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains.

Thanks

Issue Analytics

State:
Created 5 years ago
Comments:5 (3 by maintainers)

Top GitHub Comments

1reaction

chuanconggaocommented, Dec 6, 2018

Hi, you seem to misunderstand the concept of pattern.

For example for one of your provided sequence [22, 1, 30, 1, 24, 30], pattern []22, 30, 30 IS a sub-pattern of this sequence. It is allowed to have other items in between.

0reactions

ghostcommented, Nov 26, 2018

I have attached a file with some example sequences. It does not contain sequences with repeated items (i.e. where the same number appears once and then immediately again) but in the output I obtain, for example:

(156, [22, 30, 30])

Thanks for your help

Attached file: seqs.txt