187. Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

S: rolling hash + bit manipulation O(n)

因为只有四个不同字母,可用2个bit表示一个字母,将长度为10的string转化为20个bit表示的int。

每次长度为10的window往前移一步,清除最前两个bit,整体左移两位并更新最后两位,得到新的表示该长度为10的string的int。

    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> result;
        if (s.size() < 11) return result;
        unordered_set<int> visited;
        unordered_set<int> inResult;
        unordered_map<char, int> map;
        map['A'] = 0;
        map['C'] = 1;
        map['T'] = 2;
        map['G'] = 3;
        int num = 0;
        for (int i = 0; i < 10; ++i) {
            num = num << 2 | map[s[i]];
        }
        visited.insert(num);
        for (int i = 1; i < s.size() - 9; ++i) {
            num = (num & 0x3FFFF) << 2 | map[s[i + 9]]; //mask = (1 << 18) - 1;
            if (visited.find(num) != visited.end()) {
                if (inResult.find(num) == inResult.end()) {
                    string str = s.substr(i, 10);
                    result.push_back(str);
                    inResult.insert(num);
                }
            } else {
                visited.insert(num);
            }
        }
        return result;
    }

results matching ""

    No results matching ""