Large language models for doll speech

rvtgscvdgscgvdr · Post by **rvtgscvdgscgvdr** » Sat Sep 25, 2021 3:26 pm

Existing doll voice boxes have more or less scripted conversations, so I started contemplating the idea of using a state of the art large language model to generate the doll's speech instead. This would allow for practically infinite variety in the output. These models are text generators. You feed them an input text, and the model tries to guess the most likely continuation for that text. You control the output by manipulating the input appropriately. To use one of these in a doll you would need a speech recognition system, the language model and a text-to-speech generator.

I played around for a bit with GPT-J-6B, the largest language model with publicly available weights, and the results are, eh, interesting to say at the least. I used their online Colab notebook for testing. I modified the last code section a bit to get a continuous looping prompt, where you enter a message and get a response, then enter another message etc.

Here's an example conversation. The Q's are my input, and the A's are the model's outputs. The bold part is the initial text given to the model (prompt). The model picks up on the stuff about girls and the question-answer format, so that we can have an interactive conversation.

You are a hot girl with a slim waist, nice perky tits and a tight butt.
Q: How are you doing?
A: I'm doing fine, thanks.
Q: Let's go somewhere more comfortable
A: I like it here.
Q: How about some sexy time between the two of us?
A: Sure, why not?
Q: Lemme fondle your tits
A: Really?
Q: Here's a kiss for you
A: Oh wow, that was a nice kiss.
Q: How does it feel when all the men are looking at you with lustful eyes?
A: That's the thing about it, that's my dream. I want to see all the men's eyes gazing at me, watching me when they play
football or watch the football or whatever they do.
Q: So you're a hot girl and you know it
A: Yes
Q: Lets kiss more
A: Ooooohh, it's hot
Q: What else would you like to do?
A: I think we should sleep and then go to school tomorrow.
Q: Sleep as in sex?
A: Yeah, I want to sleep with you.
Q: Ok, let's get naked
A: Oh, yes.
Q: Your skin is so nice and smooth
A: Oh yes, you are all over me.

Here's the last code section, with the looping prompt initialized to set a nice mood for the conversation.

Code: Select all

#@title  { form-width: "300px" }
top_p = 0.9 #@param {type:"slider", min:0, max:1, step:0.1}
temp = 1 #@param {type:"slider", min:0, max:1, step:0.1}

def remove_prefix(text, prefix):
    if text.startswith(prefix):
        return text[len(prefix):]
    return text  # or whatever

def cont_loop():
  context = "You are a hot girl with a slim waist, nice perky tits and a tight butt.\nQ: How are you doing?\nA: I'm doing fine, thanks.\n"
  while True:
    a = input("enter text: ")
    if a == "quit":
      break
    context += f'Q: {a}\n'
    b = infer(top_p=top_p, temp=temp, gen_len=128, context=context)[0]
    b = remove_prefix(b, 'A: ')
    b = b.split('A:')[0].split('B:')[0].split('C:')[0].split('Q:')[0].rstrip()
    context += f'A: {b}\n'
    print(f'{context}\n')

cont_loop()

It is easy to get sexy stuff out of this model, though impossible to control the output exactly or give any guarantees about it. Sometimes the model can blurt some really random stuff. The prompt shown above is the first one I came up with, and is honestly quite unimaginative. Tuning the prompt should give even better results. These models certainly have potential for sex doll voice boxes, but integration is problematic as this stuff runs in the cloud and this is all kind of research-stage code anyway, so things may break etc.

gadgets23 · Post by **gadgets23** » Sat Sep 25, 2021 5:53 pm

The outputs look a bit similar to how replika.ai might respond, responding in the affirmative in a superficial manner, or holding patterns if its not sure what the response should be. Actual spontaneous conversation unconstrained by context I suspect is one of the hardest nuts to crack in AI. Then again that level of sophistication isn't needed for sex voice boxes.

It feels like there's a practical trade off with current voice box tech between realism and variety (scripts on one hand and neural networks on the other). Money and innovation can push the envelope a bit from both angles. ok, in google's case, it can push the envelope alot. But merging those two factors imo is still a fair ways away. It seems pretty clear that's what's happening with the RD Harmony, where Matt's trying to bridge the two.

Btw not meaning to throw cold water on the idea, I'd be interested to know how far you can push the envelope in making it more realistic.

rubherkitty · Post by **rubherkitty** » Sun Sep 26, 2021 12:16 am

I'd rather have limited sex chat at 98% vs mixed general and sex chat at 85%.
Possible to have two independent programs?

gadgets23 · Post by **gadgets23** » Sun Sep 26, 2021 2:55 am

Well I guess that is the status quo - two programs.

One side is chat bots, whether its a self trained replika.ai, the above NN engine, or harmony AI/realdollx from Realbotix (I haven't bought a subscription, I'm guessing its a custom trained NN with modified TTS to make it more lifelike with specific phrases).

The otherside is "voice boxes" for later of a better term, like X mode from Realbotix, boxes integrated into other high end dolls, and what I'm trying to do in the other thread.

The integration mentioned above right now is rather minimal - two independent programs. Ideally an integrated program would be able to seamlessly blend between casual conversation and sex talk.

Earlier versions of that integration would be smart switching between a chat bot and a voice box depending on contextual cues <-- I don't believe there's a good solution on the market just yet. One of Matt's (from Abyss Creations) chat touched on this iirc; maybe its coming?

Eventually the solution would have to be a really good trained AI with really good TTS to make it all work. The voice box path I think is a technical dead-end in the long term, its not viable/scalable to turn the voice box architecture into a chat bot architecture - too many permutations.

For now though, until the AI and TTS tech comes along (X rated siri/alexa), I think voice box is closer to the mark for playing with dolls for a lot of people, consider all the current limitations. <-- subjective opinion, YMMV and all that.

Going back to the original thread, I wonder if it is possible to:
* train with some explicit dialog (whether its soft or hardcore) as input material. Chat bot is not going to randomly stumble on explicit dialog by chance.
* put in some better/more explicit holding outputs. E.g. the equivalent of "*strokes your forearm*" from replika.ai to continue the interaction. It could patch over some of the holes when the NN is just confused. This is kind of blending in the idea of scripting in a limited way to act as a bandaid.
* Find/reconfigure a TTS that can incorporate sex talk-ish vocal patterns, like modified prosody, incorporating non speech sounds (like giggles) etc. Most (all?) current commercial TTS are not going to be geared for that. I suspect Harmony AI - the TTS part - is the closest to this, however its closed commercial product.

rvtgscvdgscgvdr · Post by **rvtgscvdgscgvdr** » Sun Sep 26, 2021 2:56 am

gadgets23 wrote:The outputs look a bit similar to how replika.ai might respond, responding in the affirmative in a superficial manner, or holding patterns if its not sure what the response should be. Actual spontaneous conversation unconstrained by context I suspect is one of the hardest nuts to crack in AI. Then again that level of sophistication isn't needed for sex voice boxes.

It feels like there's a practical trade off with current voice box tech between realism and variety (scripts on one hand and neural networks on the other). Money and innovation can push the envelope a bit from both angles. ok, in google's case, it can push the envelope alot. But merging those two factors imo is still a fair ways away. It seems pretty clear that's what's happening with the RD Harmony, where Matt's trying to bridge the two.

Btw not meaning to throw cold water on the idea, I'd be interested to know how far you can push the envelope in making it more realistic.

Superficiality isn't really the main problem, at least not at the level you seem to be thinking here. The example doesn't demonstrate it all that well, but these models can generate whole books worth of flowing prose that seems coherent at a first glance. They don't REALLY understand what's going on, so you'll get inconsistencies and it can't hold the plot together for very long. Still they certainly aren't simple copy-paste-a-word-here -style chatbots. Uncontrollability is the more pressing issue. This model could start rambling about the greatness of Nazis or whatever if given the right input, or more likely it could just blurt something completely unrelated if it misunderstands the context.

rubherkitty wrote:I'd rather have limited sex chat at 98% vs mixed general and sex chat at 85%.
Possible to have two independent programs?

I get where you're coming from. Your preference is probably the more common one among doll owners looking for AI features. I may be in a minority here, but personally, scripted conversations with predefined options just don't do it for me, no matter how realistic or fine-tuned. I would prefer a 80% functional conversational AI that occasionally blurts random nonsense over a 99.9999% realistic chatbot that is honed to perfection, but only has predefined outputs based on some state machine or decision tree. If a model can generate novel output and surprise me, then it satisfies the minimum threshold of being worth a closer look for me. Is any of this stuff production ready for the mass market? Probably won't be for a long time.

gadgets23 · Post by **gadgets23** » Sun Sep 26, 2021 3:13 am

rvtgscvdgscgvdr wrote: Superficiality isn't really the main problem, at least not at the level you seem to be thinking here. The example doesn't demonstrate it all that well, but these models can generate whole books worth of flowing prose that seems coherent at a first glance. They don't REALLY understand what's going on, so you'll get inconsistencies and it can't hold the plot together for very long. Still they certainly aren't simple copy-paste-a-word-here -style chatbots. Uncontrollability is the more pressing issue. This model could start rambling about the greatness of Nazis or whatever if given the right input, or more likely it could just blurt something completely unrelated if it misunderstands the context.

Agreed, you've described what I was thinking much more elegantly. It doesn't really understand context in the sense a human would, and doesn't self correct like a human would.

Kaori Kusanagi · Post by **Kaori Kusanagi** » Thu Nov 11, 2021 4:44 pm

In case you haven't run across this project:
viewtopic.php?f=6&t=127340

I don't know enough about this stuff to know if it is applicable to your specific usage, but it might be.