Task

I’m working on my final project for school, we are supposed to make a web app of our choosing and there has to be specific features in it. One of it is all data must be encrypted, and the other is that we have to have a search functionality. My app (A customer support framework) has a ticket functionality where customers can submit help request tickets, the contents of these tickets need to be encrypted at rest, at the same time admins need to be able to search contents of tickets.

Current Plan

My current plan is to store an AES-256 encrypted copy of the message message.content to meet the encrypted requirement, and also store a tokenized and hashed version of the message message.hashed to meet the searchability requirement.

The tokenization/hashing method will be:

  • strip the message to alphanumeric + whitespace ([a-zA-Z0-9 ])
  • tokenize by splitting the message by whitespace,
  • SHA-256 each token,
  • rejoin all the hashed tokens into a space seperated string and stored in the message.hashed field.

Thus this is a test string becomes <hash of this> <hash of is> <hash of a> <hash of test> <hash of string>

When the user searches their search string goes through all of the steps in the tokenization/hashing method, then we query the message table for message.hashed LIKE %%<hashed string>%% and if my thinking is right, we should be able to find it.

Concerns

  • Statistical analysis of hashed tokens
    • I really don’t see a way around this, to make the string searchable the hashing needs to be predictable.
  • message.hashed field could potentially be huge, if each word is getting a SHA256 hash, a large message could result in a very large hash string
    • maybe we just store the last 4 of the hash?
      • This would increase collisions, but the likelihood of multiple last 4’s colliding in a given search string should be pretty dang small, and any collisions would likely not be valid language.
      • Would this help with the statistical analysis concern? Increasing collisions would decrease the effectiveness of statistical analysis. It would be a performance hit, but after returning all matches against the hashes I could decrypt the message.content data and search the raw search query against the unencrypted text and remove any incorrect returns caused by collisions.

I’m interested in hearing everyone’s thoughts, am I being logical in my reasoning?

  • litchralee@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    1 day ago

    The other commenters correctly opined that encryption at rest should mean you could avoid encryption in memory.

    But I wanted to expand on this:

    I really don’t see a way around this, to make the string searchable the hashing needs to be predictable.

    I mean, there are probabilistic data structures, where something like a Bloom filter will produce one of two answers: definitely in the set, or possibly in the set. In the context of search tokens, if you had a Bloom filter, you could quickly assess if a message does not contain a search keyword, or if it might contain the keyword.

    A suitably sized Bloom filter – possibly different lengths based on the associated message size – would provide search coverage for that message, at least until you have to actually access and decrypt the message to fully search it. But it’s certainly a valid technique to get a quick, cursory result.

    Though I think perhaps just having the messages in memory unencrypted would be easier, so long as that’s not part of the attack space.