Protecting user privacy with voice conversion

by COMPRISE

04 August 2021

Voice conversion

The idea behind anonymising a voice by “voice conversion” is to apply the voice characteristics of a different speaker to the original speech signal while keeping the spoken content understandable. This way, if an attacker gets hold of the speech signal, it will be difficult for the attacker to re-identify the original speaker. The main advantage of voice conversion is that it does not try to get rid of the speaker’s voice characteristics, but rather enforces a different speaker’s characteristics over it. In other words, to maintain the balance of “privacy & utility”, it is easier to replicate another person’s voice than it is to neutralize the original voice: “I want this voice” than “I do not want any voice”.

How do we evaluate voice conversion?

In order to evaluate the quality of the anonymization (i.e., how well the privacy of the speaker is protected), we evaluate how easy it is for an attacker to re-identify the original speaker. To re-identify the speaker of a transformed utterance, the attacker compares it with one so-called enrolment utterance from every possible speaker. This is done by computing the distance between the utterances using a trained distance model, which is a method similar to that used in the Voice Privacy Challenge.

To evaluate this, we use Top-k Precision, a metric which measures how often the true speaker is among the k speakers which are found to be most plausible by the attacker. In addition, we can use different types of attackers that are defined based on their knowledge of the anonymization method and the effort put into conducting the attack.

Ignorant: the attacker does not know about the anonymization method and uses un-transformed enrolment data.
Lazy-informed: the attacker anonymizes their enrolment data with the same anonymization method.
Semi-Informed: in addition to anonymizing the enrolment data, the attacker re-trains the distance model with anonymized data.

X-Vector Based Voice Conversion

Using the setup proposed in this work and illustrated in Figure 1, we extract from a given speech signal, 1) the content portion in the form of pitch and bottleneck (BN) features and 2) the speaker portion in the form of an x-vector. The pitch encapsulates how the speech signal was spoken and the BN features encapsulate the verbal contents of the speech signal. The x-vector embeds in a single vector the speaker’s characteristic (e.g., x-vectors are used in speaker verification systems to distinguish utterances from different speakers).

The goal of this method is to re-synthesize a speech signal that maintains the same verbal contents while enforcing a new target x-vector corresponding to another speaker. To do this, two modules are used to generate the speech signal: 1) a speech synthesis acoustic model that generates Mel-filter bank features giving the pitch, the target x-vector, and the BN features, and 2) a neural source-filter (NSF) waveform model that produces a speech waveform giving the pitch, the target x-vector, and the generated Mel-filter banks.

COMPRISE voice conversion technique Figure 1 : X-vector based voice conversion technique.

With the x-vector retrieved from the original speech signal, a target, replacement x-vector needs to be selected from a pool of speakers. Multiple selection strategies were experimented with some taking into consideration the original x-vector (e.g., selecting from the farthest or nearest x-vectors to the original speaker), but the most successful selection techniques were independent of the original speaker. Mainly: 1) the Random strategy where we average N x-vectors uniformly selected at random from the same gender as the original speaker and 2) the Dense strategy where we identify clusters of x-vectors in the pool and rank them based on their density (i.e., number of members), then we randomly select a cluster out of the densest clusters and then average half of its elements selected at random.

Privacy evaluation

As expected, the identification of the speaker’s voice becomes more difficult as the number of possible speakers increases. This is the same for both original and anonymised data, however, the identification of the original, non-anonymised voices remains well better than chance, even among thousands of possible speakers. As for the transformed, anonymized data, the identification performance soon reaches the level of chance, or worse as the number of speakers increase. With the re-identification proving to be more difficult for the transformed voices, we look at the probability that the target voice is in the top 20 possibilities (Top-20 Precision). As indicated in Figure 2, the possibility that the anonymised data is in the top 20 drops drastically compared to the original voice data (Baseline). This indicates that after anonymising the voice, it becomes much more difficult for an attacker to re-identify a speaker. Furthermore, Figure 2 indicates that even when the attacker is highly motivated and informed about the anonymisation method, hiding an anonymised voice with 52 other speakers is equivalent to hiding the original non-anonymised voice with 20,500 other speakers.

COMPRISE speaker identification Figure 2 : Top-20 precision of speaker identification for different attackers as a function of the number of possible speakers. The numbers of speakers needed before anonymisation (N on blue curve) and after anonymisation (n on red curve) to achieve an equivalent drop in precision are highlighted.

What about utility?

To evaluate whether the speech signal that was anonymised using the method mentioned above has preserved the spoken content and the overall diversity of the speech, we verify if the anonymised speech is usable for training an ASR system. For this evaluation, four cases were studied depending on whether the data to decode and the data to train are original or anonymized. As can be seen from the WER (lower is better) in Figure 3, anonymized data can be used to train an ASR model capable of decoding anonymized data with an error rate comparable to the baseline.

COMPRISE decoding scenario Figure 3 : Utility evaluation with ASR (O = Original, A=Anonymized, X-Y means = Decoding X using a model trained on Y).

Conclusion

Speech anonymization based on x-vector voice conversion lowers the re-identifiability of the speech signal significantly even when the attacker has knowledge of the voice conversion method that was used. Furthermore, results from the ASR model training show little difference with models trained on the original speech, indicating that the anonymized data preserves the spoken content. This can be seen as a win-win situation between privacy and usability.