187. Repeated DNA Sequences
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",
Return:
["AAAAACCCCC", "CCCCCAAAAA"].
S: rolling hash + bit manipulation O(n)
因为只有四个不同字母,可用2个bit表示一个字母,将长度为10的string转化为20个bit表示的int。
每次长度为10的window往前移一步,清除最前两个bit,整体左移两位并更新最后两位,得到新的表示该长度为10的string的int。
vector<string> findRepeatedDnaSequences(string s) {
vector<string> result;
if (s.size() < 11) return result;
unordered_set<int> visited;
unordered_set<int> inResult;
unordered_map<char, int> map;
map['A'] = 0;
map['C'] = 1;
map['T'] = 2;
map['G'] = 3;
int num = 0;
for (int i = 0; i < 10; ++i) {
num = num << 2 | map[s[i]];
}
visited.insert(num);
for (int i = 1; i < s.size() - 9; ++i) {
num = (num & 0x3FFFF) << 2 | map[s[i + 9]]; //mask = (1 << 18) - 1;
if (visited.find(num) != visited.end()) {
if (inResult.find(num) == inResult.end()) {
string str = s.substr(i, 10);
result.push_back(str);
inResult.insert(num);
}
} else {
visited.insert(num);
}
}
return result;
}