What It Actually Takes to Build an AI Company [Ft. Francois Chaubard, Y Combinator]
Academics make terrible founders. YC does not fund ideas, generally. Even I had this experience — out of thirty one companies we funded, maybe three or four pivoted before they even got to the batch. LLMs will be better than us on specific things, and they already are. But for fluid intelligence, we still are far superior. Given the advent of Claude Code and vibe coding, the activation energy to start a company is way down. So back to "the something of somewhere, the nothing of nowhere." If you were going to start a company, what would you focus on as a mentor? What would you advise? The first thing is: do a company. One skill you need to be an entrepreneur is not to be Einstein. I think Einstein would be a very bad entrepreneur. Why are Steve, Bill Gates, and Jensen so good? Because they've failed so many times. There is no shortcut. What advice do you have for folks who feel YC isn't accessible to them? Two things. Today I'm with Francois Charton, a visiting partner at YC, and I'm Chad Lohrli, founder of Cadre AI and SDx. We're in Woodside, California — the offshoot of my house to get away from the crying babies, where I can get some solace and work done. First time here. We drove up through Stanford, where you're doing your PhD. We've met a few times, had you in San Diego for a YC/SDx event. Let's get a bit of your background — New York, Delaware, Stanford, started a company, now back at Stanford. Tell me the trajectory. I grew up in upstate New York, a town called Highland Falls, right outside West Point. Went to University of Delaware. I was the first person in my immediate family to go to college — was I even going to go? Did math and mechanical engineering, then graduated. Got what I thought was the coolest job of all time at Lockheed Martin, working on ballistic missile defense — watching missiles hit other missiles two miles above the Earth's crust going thousands of miles an hour. Then I quickly realized working for the government as a defense contractor is probably the most boring job. Very unionized. People work forty hours a week and yell at you if you work more. I genuinely wanted to work a hundred hours a week. After about two or three years in Lockheed's engineering leadership development program, they force you to apply to grad school. I applied, got into Stanford. To my shock — I had no idea that was going to happen. Was that where you wanted to go? I thought it was worth a shot, a free call option. Did well on standardized tests. To show how little my immediate circle knew about Stanford: I told my mom I got in and she said, "Is that in Connecticut?" That's not the community we grew up in — education wasn't a big piece of my upbringing. I got to Stanford in fall of 2012 — the best time to start in Fei-Fei Li's computer vision lab. The AlexNet paper got published. Off to the races, been doing deep learning and backprop ever since. I've got a similar but different background. I spent time at JPL and came to the same realization — working in government, too many limiters. Instead of Stanford, I started doing stuff in crypto, then made my way back to ML. My only Stanford experience was CS231n — one of the most pivotal online courses I took, one of my entry points into ML. At the time Andrej was doing CS231n, I was doing CS231M with Silvio, Fei-Fei's husband. I respect Stanford for this — every other major university has a long lag from when things happen in the real or academic world to when they make it into the curriculum. To Stanford's credit, Fei-Fei let Justin and Andrej — Justin doesn't get credit, he co-taught and co-created it. Andrej went to Fei-Fei and said, this needs to be a class. We were still teaching bag-of-visual-words, HOG and SIFT descriptors. We need to not be doing that — it's completely different now. Fei-Fei saw it quickly. I still remember serious professors at Stanford saying deep learning might work for computer vision but won't work for anything else. Then Richard Socher, a PhD while I was creating CS231M, wanted to do the same kind of class for deep learning for NLP. Even Chris Manning said, I'm not sure it's worth a full dedicated class — I agree we should teach it, just not the only one to teach it. Richard got the nod, asked me to co-create it with a few others, and we made CS224D. Two years later they gutted all the bag-of-words and CBOW content and replaced it with ours. The main NLP class at Stanford still has my name on it. The cool thing I observed was how quickly Stanford professors gutted the curricula to apply the latest. I don't know how San Diego is, but at University of Delaware I'd be shocked if they're teaching transformers yet. That's the difference in education — you're getting it so much quicker to the bleeding edge. I think it's a problem. I went to UC San Diego. My girlfriend at the time was hesitant because their engineering program wasn't accredited — for some engineers you need a professional engineering degree. She did mechanical. I didn't care, I was doing CS, give me the bleeding edge. But it takes time to roll courses out, and by the time you graduate you're far behind. With AI now, if you're not at the cutting edge or even using LLMs, you're knee-capping yourself. Is Stanford continuing to be on the cutting edge — same rate, faster? They're maintaining it by accident. There's no overt effort to figure out the bleeding edge. It's twofold. One, industry leaders want to teach at Stanford — that "Coachella lineup" frontier systems class everyone saw on Twitter, with Satya, Sam, and others. Two, every single Stanford CS professor has a company that's raised over $300M. Every single one — no exception. In 2012 it was an extreme anomaly: Daphne, Andrew, and Sebastian were the only ones with companies. Fei-Fei didn't. Silvio didn't. Jure Leskovec just got there. Now it's fully a thing. There's a pro and a con. The con: a reverse classroom — they show up and just ask questions. I went to Jure's CS229G for graph neural networks. He showed up to the first lecture, didn't show after that — sent grad students. Not good. But the content is great because it's state of the art, and he's actually deploying it, enough to have a company that can raise. James Zou just raised at like a billion-dollar valuation doing autonomous research. Is he a professor? Yeah, and an adviser. Chris Ré is very entrepreneurial. Chelsea Finn — Physical Intelligence. The con is they get less involved, focused on startups. Running a startup is not a part-time job; being a professor could be, but probably should be a full-time job. On the margin, give me the bleeding edge — I want my PI on the cutting edge. You talked in one of your blogs about the academic founder versus the entrepreneurial founder. These professors are researchers by nature, but they're also running companies — both parts of the mind, where for some it's usually one or the other. Coming from doing deep research in Stanford AI Lab, then ten years of entrepreneurship, now back to research — I didn't realize the difference in mentality until now. I really understand why academics make terrible founders. Joan Lonsdale talks about this — MIT-style startups vs Stanford-style. MIT-style: we did all this research, invented this new tech, got it patented — look at this magic black box, isn't it cool? How does it work? Can't tell you. Okay, don't you want it? No. That's a terrible startup. It never works. Google is maybe the only example where a better algorithm actually outperformed, and that was great — but it's not sustained differentiation. Good luck patenting since Alice — the Supreme Court basically ruled all software is business methods, which aren't patentable, enforceable, or detectable. It's not a defense anymore. Stanford-style: I'm obsessed with the problem. I want to get people their groceries wherever they are. Part of that has technology — maybe getting 1099s in cars to drive your groceries. That's what Apoorva and Max Mullin did at Instacart. What was the black box there? None. Just obsessed with the problem. Apply technology, business process, operations excellence — whatever it is, by hook or by crook, solve that problem better every day. Airbnb, Brian Chesky — what's the black box? Obama O's? Just obsessed with the problem and pulling every lever. That's the best way to build startups. Your customers don't give one iota whether it's an algorithm enhancement, better software, or better hardware. That's the main reason academics are poor at this. To publish, you have to have novelty — you can't copy. In startups, copying is recommended; pioneers get arrows in the back. What did Instacart do that was novel? Webvan came around way before them and invented a lot. Then DoorDash after Instacart, Rappi after DoorDash and Instacart. We were in the Pitch class of Rappi, Winter 2016. Simon got on stage and said we're the Instacart of LatAm. I said, why isn't Instacart the Instacart of LatAm? Turns out Instacart can't be — they don't understand LatAm. It requires someone from Colombia. We have quite a few Colombian team members — that's when I first learned about Rappi. Visited Colombia, they said yeah we use Rappi. I said where's Instacart? PG said it: get as many touchpoints with reality as you can. That's the YC ethos. Let's talk about what you're doing at YC. How long have you been there, what trends are you seeing as we've gone from ChatGPT to wrappers, agents, to what's next? After nine years in the cockpit at Focal — about seventy years in startup life — I wanted to get to PG's Default Alive, then we signed a $60M deal. After that I was done. We were Default Alive, I hit my goal, took a step back. My options: start another company, which I had a lot of pressure to do. Initial Focal investors said let me know when you have the next one, I'll write your term sheet now. Had ideas, had employees saying, I know you're leaving to start another company, I'm down to leave. My wife said, you have to take at least a year. So what do I do for a year? I wanted to get back deep into research. I'd forgotten how a screen session worked. How does tmux work? Docker compose? Forgot it all. I wanted to get back technical. I left deep learning hands-on when training on one 1080 or V100 was a big deal. Now you have multiple GPUs, nodes, data centers. I wanted to learn torch distributed and the like. Went back to research. My professors said, you should apply back in. I did, got into the CS PhD program, been doing research. I caught up with Jared Friedman, my partner at YC, showed him my ARC AGI results — loss curve going down. On Fridays I did investments — invested in a couple of YC startups, did demo days. He said, Francois, we're trying to get deeper into AI, you seem to want to help startups, why don't you do that at YC? I didn't know it was an option — it'd be a dream job. I put together a slide deck for Garry on ideas to nudge YC deeper into AI. He said: accept all changes. And now you're using GStack, right? Exactly. That was about eight months ago. I've been at YC since October last year. Did my first batch with Jared. We focus more on the AI stack and what's coming. We have our points of view. It's been a dream — fun to help. I love YC. We give opportunities to folks no one else would have funded. If PG didn't fund Airbnb, Brian Chesky literally said they were going to go back to Rhode Island and design PowerPoints. We wouldn't have Airbnb. So many examples like that. And generally, social mobility. You talk a lot about how YC has increased social mobility. What other company has done more for social mobility? I would have said that before joining back in, but watching it firsthand — take Simon at Rappi. He created a multi-billion-dollar company from an idea he had living in Colombia. If he could have funded that elsewhere, he would have. It enables these companies to pop up and contribute meaningfully to job growth and GDP. That's what's required. Curious about Simon's story. How did he hear about YC? He could have gotten funding in Colombia. My co-founder was Bolivian, they were close, we'd grab dinners. He had very little investment. It wasn't his first company — did two or three before. Applied to YC three or four times. He had little bits of funding to keep going — $20K, $50K here and there. Then he got into YC. Thought, I finally made it. Got on stage. Did not fundraise well. It wasn't, here's $13M from Andreessen — it was crickets, for the same Peter Thielism: the something of somewhere is the nothing of nowhere. Everyone thought the Instacart of Colombia should be Instacart. Turns out no. Then Rappi exploded. Simon is a great entrepreneur. He finally got Sequoia to invest meaningfully — I think Andrew did the investment — they took off. But Sequoia wouldn't have been on him if he was based in Cartagena building from Medellín. No way it would have been on Sequoia's radar. We have a presence in Colombia — many AI tinkerers there, a lot of great ML talent. Andrew Ng taught down there with Landing AI. What advice do you have for folks who feel YC isn't accessible because they're in South America or elsewhere? We're actively trying to get more of the world involved with YC. Just in the last two weeks Jared, Ankit, and Harshita went to India and did a 2,000-person India startup school. Trying to get more entrepreneurship participation worldwide. We get a lot of applications — about two-thirds international now. We fund people from all over. Two things. One, people don't even know — like my mom not knowing where Stanford was, my parents not knowing who Warren Buffett was. Just knowing about YC is step one. Knowing that's a path. That's what we want to promote. You don't get there unless you show up — that's why we came to San Diego, why we're doing college tours. It's amazing how few people think, oh, I could start a company. They think, I should go work at Google and be another cog, or worse, Lockheed Martin. Two, given the advent of Claude Code and vibe coding, the activation energy to start a company is way down. Back to the something of somewhere is the nothing of nowhere. Why couldn't Instacart be the Instacart of LatAm? So much bespoke knowledge — how drivers think, how customers think, how to price, what matters for mopeds. They use mopeds with massive orange backpacks branded with the Rappi mustache. Unless you grew up there, you wouldn't know that's the way. There's geographic segmentation of Uber — Middle Eastern Uber, New York Uber, many Ubers. Same with delivery. In New York we've had Seamless forever. Geographic distribution allows more fragmentation because anyone can start with local knowledge and do epsilon better locally, with much less startup capital. Local dynamics give better product-market fit. It might make sense to build a better search engine or a better Yelp for every country. That was impossible because the software engineering and branding for Yelp was too much — activation energy didn't make sense. But now it's so much lower, you can make a Yelp in Bulgaria bespoke to the Bulgarian community, which I know nothing about. That's my rant on international startups. Now is the best time, and first-mover advantage will matter a lot. So take an idea that works in another country, and because activation energy to build it with Claude Code or Codex is so low, just build it, go to market, use your local knowledge. Novelty isn't important in startups — it's actually a con. It's unproven, and even if you prove it, you got war-torn trying many things, you've raised so much money your cost of capital is higher. Just copy what works in America. Rocket Internet in Germany copies all American companies and does it in Germany. I'm not saying it's a good thing, but it's a great business. Back to the idea we started on: academics suck at this, they wouldn't do it. They want to be novel. When you do research you want to figure out something new about the world. Taking an existing idea and retrofitting it for a new culture feels not novel. That's the copying. Even for novelty ideas — I do this dinner table prompt. My wife hates it but it's fun. We did a surf trip in Sayulita. Before appetizers, I forced our friends to whip out Lovable and come up with a startup idea built by the end of dinner, then vote. One girl is super deep into astrology — Pisces rising, that level. She used the right terms in Lovable and it built this serious math equation to score your compatibility with your partner. Every word would lead to three divorces or something. It had reasons — here are the friction points in your marriage — and it was weirdly really true. Would people pay $1 a month or $10 once for that? Might be a real business. She built it in an hour with Lovable. Are there a bunch of non-venture-scale businesses that pop up — lifestyle businesses making $3M a year? I'll take $3M a year. What's your COGS? Zero. What's changed in YC's investment thesis now that the bar to build is so low? Is the bar higher on what you build, or is distribution more important? Still betting on founders? Has anything fundamentally changed? Garry is definitely AI-pilled. We're probably the most AI-pilled of the venture community. Every single partner is very technical, an ex-founder, an ex-YC founder. A unique group. Everyone has their Claude — like, open Claude — everyone's after you all wore the — I thought it was funny. That's why I love YC. So I think the aperture of what's fundable is wider. YC does not fund ideas, generally. Even I had this experience — out of 31 companies we funded, maybe three or four pivoted before they got to the batch. We fund them in December, January 6th they start — in a month they've already pivoted. Some already had revenue on the new idea, like Odd Voice did. An insane story. What we've always funded, even more true now, is what Garry calls agency and taste. Are you a person who says, well, we tried — or, I tried 600 things? Odd Voice is a great example. Two Thai brothers, voice agent company. Funded December 15th. Show up January 6th, office hours, they say we pivoted to Harvey for architecture. Okay, interesting — know any architects? Yeah, we already have six architects paying us a thousand a month. What? That happened three weeks ago. That's insane agency. They pivoted, built it, got customers, got them to pay. That's the agency YC loves. So many people, especially academics, sit in their corner, code code code, keep working on some benchmark, never release. That's the antithesis of YC's ethos. Cursor did a podcast I just loved. They shipped their initial version four months after launch — big jump, then the trough of disillusionment. Everyone turned off Cursor. But they were building it themselves to dogfood and try it themselves, which is so important. Most of the time you can dogfood, even building software for grocery stores — we built a grocery store in our office to dogfood. The argument that you can't dogfood because it's a new vertical is wrong — you almost always can. Because they were dogfooding, they really built it out. They wondered if they should pivot to bug-fixing only, like a Sentry bot. They kept working on what they wanted until it got good. Most people think they're too smart for that — I can skip it, I know what's good, I'll build it myself because I'm Steve Jobs. No, you're not. Steve Jobs is Steve Jobs because he failed three thousand times by launching it out. He built a mental model of what'll stick. You haven't. He wasn't that good at 21 — he shipped a lot of products, the Newton and others, that completely flopped. The last thing on Cursor, within YC ethos: for every ten things we launched, one was going to work. Back to the PM stuff — PMs at big slow organizations get fired if they launch nine products that all failed. But that's exactly what you need to do. I would fire you if you didn't launch ten things in a month, and if nine failed and I looked at them, I'd say, yeah, those were pretty good ideas, I would have thought they'd work. That's how bad a predictor people are. One out of ten works, you just keep chasing that. With Cursor, you could tell it was being used by the team. Read Aman's tweets — super technical, posts long threads on how they pivoted from semantic indexing of the code base to glob and grep, and why that's better. You can feel the founders' passion. They were using it, getting it, and that fueled the growth. Back to academic versus founder: I'm obsessed with making coding easier, I'll work on that for a hundred years. It's never going to be solved. Versus, we have a new patent on indexing code, we'll commercialize it. No one cares. The biggest innovation in Cursor — now worth $60B — was the user interface, making the UX the way it was. That was the main innovation I didn't see before Cursor. Sourcegraph existed — we knew how to index code bases, that company's been around ten years. Autocomplete on tab has been in IDEs forever. Cursor tab wasn't that innovative. Showing the user experience, but it had to be fast — not just good, fast. Had to adapt. Push the next thing, the next thing. That makes great companies, not look-at-my-black-box. People freaked out when the UX changed frequently on Cursor — oh my god, I'm used to this, now it's moved. They listened. That takes courage too — changing a design pattern that's working takes product courage. One thing I love about Boris on Claude Code is he's constantly thinking about cutting product, cutting slash commands. He says, I think we're going to cut that one. But it has usage. Yeah, but not enough. Did you get a Rare buddy? I didn't get a buddy. I tried, it didn't work. What do you use day to day? Claude Code pretty much exclusively, with Gemini Pro as my last step before it comes back to me. I have a big fork of Karpathy's Auto-Research — I call it Auto-Research With Good Ideas. The way his comes up with the next idea when you do a new branch is, just come up with another idea. That's not how research works. Every time you finish an experiment you have zero to N new ideas. You should append that to a queue, rank-sorted by EV — expected value and return. So I have a separate ideas.md I'm constantly running a priority queue over. At the end you run a reflection period — what did I just learn? The Gemini part helps with that reflection, chain-of-thought on your prioritizing, look through the results. Then a bunch of one-off scripts you might run — here's a thing that happened, another, another. What's the rule? Oh shit, I found the rule. That's research. I figured out something not before known. That's what I'm trying to get the models to do. To Garry's credit with GStack, there's a lot of juice to squeeze with a harness — imbuing the human processes of how we do research. James Zou is leading that charge at Stanford. Chris Ré is doing a lot. I've been toying with it. We're going to get a lot of scientific research done by imbuing scientific research processes into a harness. Something I've been thinking about: given a harness, an LLM, a procedural SOP — steps to accomplish a goal — and perfect context or access to tools and APIs, what's insoluble in software with a harness and an agent? There are classes of intelligence — in complexity, automata theory, we have nicely formed classes P, NP, NP-hard. We don't have that in intelligence. We're bounded by the interpolation of human knowledge. A lot of innovations came from going cross-discipline — merging a neuroscientist and a computer scientist. There's new knowledge to be created by combining two big spikes. I think about knowledge as a spiky ball. Innovation comes from sampling both and finding a linear interpolation between them — forming a beach ball. Getting outside that beach ball, I don't think you can do — unless you figure out how to get these models to think more deeply. François Chollet talks a lot about the fluid intelligence of these models — basically there's none. There's no fluid intelligence in these models. When you say outside the ball, you mean outside the training distribution or the samples? Thinking in terms of train/test distribution or out-of-distribution is usually unhelpful — it's the full world, all of Common Crawl, what's outside? It's hard to think about, but it's evident when you give it a new task how quickly it adapts. I've been trying to harness my way to win ARC AGI 3 — you can't. The pattern, the way these things learn versus how we learn, we're way far. Example: I give you a new game called Francois Go — a new Go version where I changed all the rules. How quickly will you learn it? How quickly will the agent? You'd learn intra-game — you move, I take your piece, oh I get it. You have that epiphany. I call that batch-size-one learning. If you try to update weights with batch size one, gradients are massively unstable, you'll get NaNs quickly. Won't work. So increase the batch size — critical batch size is like 2048. You'd have to load up 2048 other experiences in your head simultaneously while learning. No chance — not even close. We learn stably and meaningfully with batch size one. That's very human. That's the key — something very different is going on. The counterexample: just throw it into context, in-context learn it. What no one knows: if you grab a random person at NeurIPS and ask, what happens if I give one experience into context, two, three, four, hundred? It doesn't monotonically improve. It improves, then falls, falls, falls — exceeds the context window. It does horribly. It wasn't trained to consistently improve as you increase examples (ICL). From one, two, three examples, it wasn't trained to do that. There are recursive-self-improvement people trying to get it to learn that, but it wasn't learned at pretrain. Inherently, the model can't do this extremely well. Amazing that it works as well as it does, but for many real examples it doesn't. We do. How? The more I play chess, the better I do. Monotonically improving — okay, "monotonic" maybe isn't fair. This is hitting my limits in this space, but I've been listening to the same podcast every flight — the Dwarkesh episode with Ilya. He talks about the value function. Is that similar to how we get to batch size one? It's a training procedure issue. There's a value function — that's a target. That isn't an update rule. The actual update rule: if you're using Robbins–Monro, SGD is one implementation. Many different implementations. Your stochastic gradient descent is one way. For those who don't know, Robbins–Monro — is that hill-climbing generalized? It's a recursive update rule stepping on some estimate of the gradient times the learning rate to update the weight and hill-climb into a local optimum. Historically, 1959 Robbins–Monro framework. We've been working off different versions for seventy years to find the right procedure. I work on zero-order stuff and perturbations off vanilla SGD to find an update rule. One thing is clear: we are certainly not doing backprop. It's very hard to see how I'm doing the transpose of the weight matrix in my head. Hard to do backprop-through-time — I'd need C versions of my brain, where do the bits go? That information cannot exist in the brain. C copies of activations, doing backprop through them — we'd see it in MRI plots. It's not there. Since 1959, we've had Hebbian learning, David Hebb's rule. We know learning looks much more like that. The bounded version — Hebb's is unbounded — is learning rate times x times y; you subtract a normalization term, which turns it into online PCA — Oja's rule. Oja's rule is stable and can update. The problem from my experimentation, why we can't replace gradient backprop with Oja's rule: it's a very weak learner. Generalization from one example is small. Online PCA with principal component one is weak. Sample efficiency isn't strong if you take exactly Robbins–Monro and apply it. But there are other learning procedures that look and smell like Oja's rule that will get us closer to closing the gap. That's what I work on. Let's talk about what you're working on for NeurIPS. Two research angles. One: batch-size-one learning. Instead of ICL, there are other cool ideas to take inspiration from. A paper from my lab called Cartridges — if you have a big code base that doesn't change much, or a bunch of experiences, and you want to compress all that in KV space, in the time space, into a much smaller compressed memory — a cartridge — you can with a self-study and reflection procedure. If I think about what I'm doing, there's that spirit of reflecting on what's been going on. Sabri calls this self-study. Similar to Google's recent paper on KV compression? In spirit, but unique. The Cartridges paper is unique — it starts from a random context, literally random numbers. I freeze the model, then learn my context in continuous space. Learning a context is strange, but I'm using SGD to learn a context and it works. Sonoma Decks has KV here. I'm learning what KV would maximize — over all possible KVs, what KV, if I made it this small, would perform well on this self-study objective freezing my model? That's what it learns. If you could do that online, that'd be cool. It's offline — maybe this is, I'm sleeping and compressing my KV cache into this little cartridge. My idea is in the same spirit but to get rid of that idea because it's expensive. You can learn a little adapter. There's something called recursive least squares where you map from where you're outputting now to your current output distribution. Freeze the model and learn a little w adapter that steers toward some new local thing. This is test-time learning. Similar to LoRA? LoRA touches the weights. Instead, I spawn a small weight matrix w. My output head outputs logits, I output a w, my actual output is original output plus alpha·w times that original output. I steer the distribution of my output logits a little toward some direction. As I get more training examples, I can quickly learn a w matrix with recursive least squares — literally order one every time I get one example. With order-one training I can get meaningfully better performance than just ICL. Saves on compute and other things. That's research direction one. The other is my Hail Mary I've been on for two years — everyone in deep learning says, why are you working on this, go work on backprop. There has to exist a learning procedure that is not the transpose of the weight matrix. Backprop is extremely expensive — I have to store activations. Things the brain isn't capable of. There's a better procedure. Some argue, if evolution could have discovered backprop it would have. Hell no. There are loss functions where backprop — the infinitesimally small local gradient — gives me direction from this infinitesimally small perturbation. Lots of loss functions, e.g. a zero/one, have no local information. The only information observable is by massive perturbation, and I have to get lucky — find a sample from zero, a sample from one. That is almost all of RL. Whether I won the game. Whether I got eaten by a lion. Whether I met my wife and reproduced. The loss function of life is extremely zero/one. Whether you reproduce and pass on your DNA before getting eaten by a lion — that is not a differentiable loss function. It just isn't. I take huge pause from that. Other learning procedures will be better — they don't require activation retention, don't require an absolute gradient estimation, and work better in high-dimensional loss landscapes. I'm hunting for one that scales to multi-billion, hundred-billion, 800-billion parameters — the size of your brain. I think we're getting there. Devil's advocate: some say the path to better models or AGI/ASI will not stem from how humans do it. Your statements are anthropomorphic — the brain doesn't do backprop. Do you believe your solution will stem from or be influenced by biology? The Yann LeCun-ism is to map this to flight. Did we need flapping wings? No. Did we need two wings? Yes, the Wright brothers needed two. Looking at everything that flies, trying to figure out flight while ignoring all of it and going off in another direction is probably a bad idea. Take inspiration. To be academic: the set of all things that can achieve intelligence is bigger than us. The learning procedures, how it learns, how it operates — going to be very different, the way a rocket, helicopter, jet, turbine, turboprop, drone all achieve flight differently. I wouldn't take a drone to Cartagena, or a helicopter, and probably wouldn't take a rocket to San Francisco. Certain things are better for certain situations. LLMs will be better in the limit. LLMs will be better on specific things, and they already are. But for fluid intelligence we're far superior. If you took an LLM and tried to play a brand new game — as ARC AGI 3 shows — humans destroy it. The RAE test — the metric for ARC AGI 3 — the best I've achieved is 0.02, meaning humans are 550 times more efficient in number of actions to get to a level. 50x. Most of the time it can't even solve it — actually never solves it. So for fluid intelligence, glaring holes. A random Craigslist responder for a $15 ad beat the LLM. Another problem I've worked on: try to make an LLM funny. I realized it's by definition impossible. People scoff — how can you know that? Take a Mitch Hedberg joke: my fake plant died because I forgot to pretend to water it. The token "pretend" has a one in 2,000 probability. With top-p sampling it will never be sampled. So it'll never be funny. The only way things are funny is high predictability, then complete shock, surprise, and recovery. That's every Mitch Hedberg joke — massively high probable, very low probable, very high probable. Nucleus sampling will never sample that trajectory, so it'll never be funny. Even with the right system prompts? You can't get there. How will you sample the funny one? I've tried beam search to get that exact shape — doesn't work. Were you the one who tweeted only two or three jokes an LLM can produce? No, but I did tweet about this whole phenomenon — a little viral thing. You need a new sampling method, minimum. We think at a higher level than token space — an idea space we can't train. What you want is contrastive — what's the minimum change in the sentence that meaningfully changes the semantic? These things can't do it, but we can. That's joke-writing. There's a deeper meaning — it tells us why LLMs can't write good movie scripts, good poems that are deep and profound and hit us hard, or maybe do great research. We were just talking about spinning up grad students, agents — you're really limited by compute. You've been mapping GPUs per student. That's becoming more important. If you want to do research or learn things, you need compute. Why go to a university that doesn't give you compute? Why work at a job that doesn't? Take Stanford — 4,000 CS students across undergrad, grad, PhD. They just launched Marlowe, 250 H100s. If you start at OpenAI you're given that as an intern. That's for the entire campus. Embarrassing. We have a $40B endowment — we can't give a billion to Nvidia to get a real cluster. I had this dinner with professors at NeurIPS. I experienced this firsthand — luckily I had an exit at Focal I can spend on GPUs. If I didn't, I couldn't even do any of my ideas. Many great minds in my lab can't get compute — that's the limiting step. I went down the rabbit hole asking everyone, where are you at on compute? Realized this isn't a Stanford problem, it's an academic problem — true in all academia. So at 11pm one night I grabbed six universities — Princeton, Harvard — took what I found online for H100s or equivalent, divided by number of students, posted it online. Then someone yelled at me. Yann LeCun yelled at me, said this is total BS — which I inferred meant total batch size. NYU wasn't there, so I added it, and they were in the bottom. They do have a thousand GPUs but six or seven thousand CS students across Tandon and Courant. How do we bring more awareness? If I went back to grad school, that's one of the first things I'd look for — do I have enough compute to run experiments? I can't afford $30,000 of Anthropic bills, and I probably want to play with open-source models and host them. In society we pay a lot in tax in America, spending money on a lot of stuff. What's going to cause the greatest good? AI is largely a cause for good and can cure a lot of disease. Read the Evo stuff — Evo has the capability of solving all disease. They can't get funding to go beyond an 80B model. If you went to an 800B model, maybe all disease would be solved. Worth it for society? What is Evo? Evo takes your LLM stack of predicting the next token and replaces it with DNA — predicting the next base pair. ACTG, predict the next. From that you get enriching embeddings — pretraining on DNA. Why useful? It's not, directly. But the embeddings produce a rich embedding space where you can train a small contrastive-learning thing — if I have ten people with a disease, I can detect that disease forever with 99% accuracy from DNA alone. Then you identify, we can solve this with one CRISPR edit. Many major diseases, especially rare ones with no research. Insane results. We're not limited by ideas, largely by compute. How do we solve that in academia? Endowments exist. There might be an issue mapping from which ideas to do — good auto-research with good ideas — to compute. But largely, we need more compute. Name a problem in the world — probability of solving it if we add more compute is like 99%. When you say compute, you mean GPUs — Nvidia. Not energy. I tried to train my models on CPUs but Nvidia and Jensen do a good job. Hard to compete. I won't get into the Dwarkesh pod and some of those things. I didn't wake up a loser. I say that every day in the mirror. Let's end on young founders — young potential entrepreneurs, what should they focus on, what AI has opened up to experiment and build the next generation of companies. What do those companies look like? The term "product" is ephemeral now because we can build so quickly and so many features for so many people. So what is product? "Services" is overloaded too — we're seeing the mesh of the two. If you were going to start a company, what would you focus on, what would you advise as a mentor? First thing: do a company. That's the larger step I wish more people would consider as an occupation. Successful entrepreneurs — I wouldn't call them in the top 10% of intelligence of people I've met, and they're wildly successful. The skills you need to be an entrepreneur — being Einstein is not one. Einstein would be a very bad entrepreneur. People think you have to come from the right background, go to Stanford. Hogwash. The best entrepreneurs don't have traditional backgrounds. My biggest advice: go do it. Have the cojones, don't take no for an answer. Once you take that plunge, take advice. No shortage of advice on the web. If you want to get to the moon, the only real people to talk to are people who've been to the moon — listen to astronauts. Everyone else — your mom, dad, parent — unless they've been to the moon, weight their advice to zero. Literally, not 0.1 — zero. Hard for people to do. My mom, once I had my job at Lockheed Martin, told me explicitly, oh, you have this pension now, in forty years you'll have an amazing pension, a job for life. Sounds awful. Can you just kill me now? Allocate your attention carefully — how much you listen to whom. Find entrepreneurs you admire and listen to every single thing they say. Last piece: most of the time you learn by doing. Why are Steve, Bill Gates, and Jensen so good? They've failed so many times. There's no shortcut. The faster you get up there, launch, fail — do that a hundred times — that's the only way you get good. My advice: throw a lot of crap out there. Don't be embarrassed. People with less social awareness make better entrepreneurs because they don't care about throwing it out and looking stupid. Who cares? Just do it. Throw it out there. Not everything has to be beautiful, amazing — keep learning. That's the best way to become a great entrepreneur. Perfect way to end. Francois, thank you for the time.
Francois Chaubard co-created CS224D at Stanford with Richard Socher: the course that replaced the entire NLP curriculum two years after launch and still carries his name. He then spent nine years as a founder before closing a $60M enterprise deal at Focal. He is now a Visiting Partner at Y Combinator and a PhD researcher working on alternatives to backpropagation. He is not a generalist with opinions.
In this conversation with Chad, Francois opens with a provocation: in terms of fluid intelligence, humans still dominate. The ARC-AGI 3 benchmark makes this concrete. The best LLMs are 50x less efficient than humans on novel tasks. Nucleus sampling structurally prevents LLMs from ever being funny, not as a solvable limitation but as a mathematical consequence of how the word "pretend" sits at 2-in-1,000 token probability. Stanford's entire compute cluster for 4,000 CS students is 250 H100s, what a single OpenAI new hire gets on day one. Each of these points has an implication, and Francois spells them out.
Topics discussed:
Y Combinator is the world's most influential startup accelerator. Where most accelerators focus on capital, YC's core product is a three-month program that compresses years of founder learning into a single batch: direct access to partners who are exclusively ex-founders and ex-YC alumni, a global network of operators across every stage and sector, and a forcing function that pushes teams to talk to customers and ship before they think they're ready. Two-thirds of the current portfolio is international, and the program funds at the pre-product stage.
You know, I love YC. I love what YC stands for. We give opportunities to folks that, like, no one else would have funded. No one would have funded Airbnb. If PG didn't fund Brian Chesky, they literally said, we're just gonna go back in, like, Rhode Island and start designing some PowerPoints or something. And so there's so many companies and examples like that. So it's just super fun to be helping, you know, these companies, these great founders, and generally social mobility.
My statement is the set of all things that can achieve intelligence is certainly bigger than us. And the learning procedures, the way it looks and smells, the way it learns, the way it operates is going to be very different in the way that a rocket and a helicopter and a jet and a turbine, like the way that they achieve flight are all very different. And a drone. Right? Those are all different ways of achieving flight. But I wouldn't take a drone to Cartagena, and I wouldn't take a helicopter to Cartagena, and I probably wouldn't take a rocket to San Francisco. So like there's certain things that will be better for certain situations. And so LMs are going to be better in the limit. LLMs will be better than us on specific things, and they already are. But in terms of fluid intelligence, we still are far superior.
Were going to start a company, what would you focus on? And as a mentor, what would you advise? I would say the first thing is do a company. Yeah. Like, think that's the larger step that I wish more people would consider as an occupation. The people that are successful entrepreneurs, I wouldn't call them in the top ten percent of the the intelligence of people that I've met, and they're wildly successful. The skills that you need to be an entrepreneur, one of them is not to be Einstein. Like, I think Einstein would actually be a very bad entrepreneur. People think that you have to come for the right background or something like that or you had to go to Stanford or whatever. It's all hogwash. Yeah. And the best entrepreneurs don't have, like, traditional backgrounds at all. I'd say that's my biggest advice is, like, go do it and don't take no for an answer.
Most of the time, you learn by doing. And so why did Steve and Bill Gates and Jensen why are they so good? It's because they've failed so many times, and there is no shortcut. There just isn't. And so, like, the faster you get up there and launch something and fail, do that a hundred times. You know, that's the only way you're going to get good.

