Algorithm outputs a series of repeated items but there are none in the training data
See original GitHub issueHallo,
I have noticed a behaviour that, to me, is a bit strange. I trained the algorithm with a series of sequences that had no repeated items, i.e. it’s not possible that an item appears again immediately after itself, like 1 in the sequence [3, 2, 1, 1, 5, 7, 2].
When I generated the most frequent sequences, though, I obtained repeated items. Is it possible?
For example, given the code:
seqs = [[22, 16],
[22, 21],
[22, 16, 14, 20],
[22, 16],
[22, 16, 34, 24, 26, 24, 26, 14, 13],
[22, 16],
[22, 26],
[22, 13, 34],
[22, 16],
[22, 21, 16]]
ps = PrefixSpan(seqs)
ps.minlen = 2
ps.maxlen = 10
freq_ratio = 0.1
freq = np.ceil(freq_ratio * len(seqs)).astype(int)
res = ps.frequent(freq)
The output has [26, 26, 14, 13]
I just made a small reproducible example, in my case the sequence dataset is ~1000 sequences. But the problem remains.
Thanks
Issue Analytics
- State:
- Created 5 years ago
- Comments:5 (3 by maintainers)
Top GitHub Comments
Hi, you seem to misunderstand the concept of pattern.
For example for one of your provided sequence
[22, 1, 30, 1, 24, 30]
, pattern[]22, 30, 30
IS a sub-pattern of this sequence. It is allowed to have other items in between.I have attached a file with some example sequences. It does not contain sequences with repeated items (i.e. where the same number appears once and then immediately again) but in the output I obtain, for example:
(156, [22, 30, 30])
Thanks for your help
Attached file: seqs.txt