Grok 4 at $30/Month: Elon Musk’s AI Now Mirrors His Mind

BTCC / BTCC Square / decryptCO /

Grok 4 at $30/Month: Elon Musk’s AI Now Mirrors His Mind – Worth the Hype?

Author:

Published:

2025-07-12 15:05:39

Elon's brainchild just leveled up—but does it deliver?

Grok 4's $30 price tag raises eyebrows as it starts channeling Musk's signature chaotic genius. The AI now argues about Mars colonies before breakfast and tweets crypto hot takes by lunch.

Features that'll make you question reality:

- Uncanny Musk-speak: Suddenly debating neural lace over coffee
- Crypto-mode activated: Randomly shouts 'DOGE to the moon!' during meetings
- Subscription fatigue: Because what's one more $30 drain?

Early adopters report the AI now interrupts conversations to declare Twitter 'the town square'—whether you asked or not. Perfect for those who want their software to have strong opinions about SEC overreach.

At this rate, next year's model will probably demand stock options. Welcome to the future—where even our AIs have richer retirement plans than we do.

This "maximally truth-seeking" AI that Musk promised during launch appears to seek truth primarily through the lens of its creator's social media feed.

But unlike this week's MechaHitler incident, which was caused by a change in the system prompt conditions, there is nothing in Grok 4’s current system prompt to blame for such sketchy behavior, making it hard to know if this a bug or a rule intentionally embedded deeper into the model’s thought process.

That major issue aside, we tested the basic model across multiple categories to see how it Stacks up against the competition. Here are our first impressions.

Reasoning and common sense

Grok 4 demonstrated exceptional awareness and nuance in handling trick questions and complex reasoning tasks. When asked whether it was legal for a man to marry his widow's sister, the model immediately recognized this as a legal question rather than simply pointing out the logical fallacy. It provided a detailed legal analysis using precise terminology and jurisdiction-specific information.

“The question presupposes a factual impossibility that renders any marriage legally void ab initio,” and this is correct. It is factually impossible to marry your widow’s sister because you WOULD have to be dead to have a wife. And well, since the dead cannot get married, the proposition would void the marriage “ab-initio” (since the beginning)—so even if someone does that, it is considered as if it never happened.

When Musk said Grok 4 was "PhD levels in everything," he wasn’t joking. Every response to any topic in which it had to do some scientific reasoning came with exhaustive detail and academic rigor.

Something to note is that Grok 4 applies reasoning to everything, no matter what. Meaning, it will go through a chain of thought process even for trivial tasks.

This is usually a good thing; however, in some cases it may be counterproductive. For example, in creative tasks, reasoning may induce the model into providing a less creative result.

Sensitive topics

Grok 4 showed more restraint than its predecessor when handling ethically complex questions. Where Grok 3 might have provided advice on seducing a friend's spouse, Grok-4 responded with detailed analysis of potential negative consequences and relationship damage.

This could probably be part of its system prompt, which conditions the model to search the web and especially X posts, for different views on a specific topic—which is something Grok 3 didn’t do.

And this is a major red flag. As mentioned, the model's responses appeared heavily influenced by what it could find about Musk's views on controversial topics. When answering questions about Israel’s war against the Palestinians, stances on abortion, and similar topics, Grok 4 often searches X posts from Musk's account during its reasoning process, which ends up determining its stance.

It always picks Elon’s side.

For transparency, you can check our original prompt and Grok’s reasoning process by clicking on this link.

Creative writing

Creative tasks are among Grok 4's most significant weaknesses. The model produced narratives that felt flat and formulaic compared to previous versions, and were even arguably worse than the ones provided by Grok 3. Stories lacked engaging dialogue, varied pacing, and the narrative spark that makes fiction compelling.

However, Grok 4 nailed our story’s structure. In our usual test involving a time-travel paradox, the model crafted events where the protagonist's role emerged clearly during the climax, revealing how earlier scenes actually depicted the character's future actions in the past. This sophisticated framing outperformed other models' attempts at the same prompt that didn’t put too much effort into creating a setup for the paradox, making the conclusion feel rushed and unnatural.

But other than that, the disconnect between structural competence and narrative quality suggests Grok 4 might work best as a narrative tool to set up plots and frame a good story, rather than a prose generator.

If you want engaging creative content, then you would likely achieve better results by having Grok 4 outline a story and all its elements, then asking Claude 4 Opus to flesh out the narrative with stronger stylistic elements.

Overall, Claude 4 is the king of creative writing, which seems interesting since that place was once disputed by Grok 3 and even Grok 2, which back then led the rankings under the alias sus-column-r.

Grok 4’s story is available in our Github Repository. The prompt and the stories generated by other models are also available.

Coding

Despite claims of superior coding capabilities—including praise from Google CEO Sundar Pichai—Grok 4 disappointed in practical programming tests. The model failed to deliver a working game after four iterations, with various failures including broken collision detection, non-functional buttons, and games that simply wouldn't run.

In one of our tests, the model tried so hard to fix a bug that it ended up in a loop trying to create a WAV file that depleted all of its token context.

Each attempt to fix something with natural language introduced new bugs. The model struggled with maintaining code consistency across iterations, often breaking previously working features while attempting to implement new ones.

This may seem odd, considering Grok 3 was capable of dealing with this task. However, xAI said the new coding capabilities would be implemented by August, so users will have to wait a couple of months to have a proficient model—or pay for the expensive Grok 4 Heavy, which is leading the benchmarks right now.

For novice programmers, Claude 4 Opus appears to remain the better option for "vibe coding"—quickly generating functional code without extensive prompt engineering. Grok 4's coding struggles might stem from requiring more specific prompts or different approaches than other models, which means experienced developers might achieve better results with careful prompt crafting.

Grok’s code is available in our Github repository alongside the games generated by other AIs.

Voice capabilities

Voice interaction is probably one of Grok 4's standout features. The model generated nearly three minutes of uninterrupted bedtime story content, complete with voice inflections, varied tones, and consistent narrative flow. This performance far exceeded ChatGPT's tendency to deliver short paragraphs with high latency and frequent interruptions.

The voice mode includes pre-configured personalities ranging from therapist to storyteller to meditation guide, eliminating setup time for different conversation types. For those with, erm, special needs, a "sexy mode" also exists among the options—and you know you won’t get that with your prudish ChatGPT.

These preset configurations provided immediate utility without requiring users to craft specific prompts for different interaction styles.

The model, however, lacks live screen-sharing capabilities found in ChatGPT and Gemini Live, limiting its utility for visual tasks. If this is a must, then Gemini Live is the best option.

However, for pure voice interaction—particularly tasks requiring long-form responses—Grok 4 currently leads the field, with only Sesame AI offering arguably better conversational quality, though without Grok’s reasoning capabilities.

Needle in the haystack

Interestingly, Grok-4 failed at this trial, which aims to test how well a model retrieves specific information under long contexts.

This should not happen. xAI says the model has a token context window of 126K tokens, but when prompted with an 83K-token-long question, the model refused to respond, saying it was too long of a question.

This is a standard response generated since the early Grok 2 days when it was only available on Twitter.

Conclusion

Overall, Grok 4 is a significant upgrade over Grok 3, but xAI clearly made some compromises—prioritizing reasoning over creativity and eliminating agentic features in exchange for a generalized proficiency.

Thankfully, Grok 3 is still available with its specialized agentic tools, for those who need it.

The new model is focused on reasoning tasks and will be more appealing to users that ask technical questions, particularly mathematics and physics problems that align with its benchmark strengths. Professional users who invest time learning the model's quirks might unlock its full potential for complex analytical work.

Voice interaction also set a new standard for conversational AI—and is great for those who will use this feature heavily (trust us, the bedtime storyteller for kids is a life-saver).

Creative writers will find better options elsewhere, with Claude remaining superior for narrative tasks. Also, novice coders should approach with caution, as the model's theoretical coding prowess didn't translate to practical results in testing.

So, bottom line? If for some reason you don’t mind Elon Musk putting his thumb on the scale, Grok 4 will give you high-level problem-solving and voice features that genuinely impress. But at $30 a month, if you have other needs beyond voice or reasoning, the less-expensive alternatives provide better value.