At the recent Google Search Central Bangkok event, Googler Gary Illyes was interviewed by Kenichi Suzuki and Rio Ichikawa.
Here is everything that was asked and what Gary said in reply!
If I block Google Extended, my content won’t be used to train Gemini app and practice AI API. However, this doesn’t affect AI overview or AI mode, even though AI overview and AI mode also use a customized Gemini as their LLM. Why is this the case? Does Google separate content for training the Gemini app from content used by AI overview and AI mode, even if it comes from Google Extended?
“Right. So as you noted the the model that we use for AIO for AI overviews and for AI mode is a custom Gemini model and that might mean that it was trained differently. I don’t know the exact details how it was trained but it it’s definitely a custom model.”
Does that mean Gemini, AI overview, and AI mode use separate indexes for grounding?
“As far as I know Gemini AI overview and AI mode all use Google search for for grounding. So basically they issue multiple queries to Google search and then Google search returns results for that those particular queries.”
Does that mean the training data used by AI Overview and AI Mode is collected by the regular Googlebot, not Google Extended?
“But you have to remember that when grounding happens there’s no AI involved so basically it’s the generation that is affected by the Google extended but also if you disallow Google extended then Gemini is not going to ground for your site basically.”
(Referencing an earlier explanation) AI mode does not get page content directly from the live web, while Gemini does. Is this connected?
“I mean I don’t I don’t know what Gemini does because I never worked on Gemini but AI mode is definitely just doing whatever we or retrieving whatever we use in our index. So it’s not doing as far as I know currently it’s not doing live fetches.”
As more content is created by AI, LLMs learn from that content. What are your thoughts on this trend and what are its potential drawbacks, like Google might have to crawl too many pages or the quality of its search index could suffer?
“I I’m not worried about the search index, but model training definitely needs to figure out how to exclude content that was generated by AI otherwise you end up in a training loop which is really not great for for training. I’m not sure how much of a problem this is right now or mainly because how we select the documents that we train on.”
You said you don’t care how the content is created, whether by humans. So, as long as the quality is high, you’ll use AI-generated content for training. So AI is trained by content created by AI, and then AI, AI, AI.
“Sure. But if the if you can maintain the quality of the content and the accuracy of the content ensure that it’s of high quality then technically doesn’t really matter.
The problem starts to arise when the content is either extremely similar to something that was already created which hopefully we are not going to have in our index to train on anyway. And then the second problem is when you are training on inaccurate data and that is probably the riskier one because then you start introducing biases and they start introducing counterfactual data in in in your models. as long as the the content quality is high which typically nowadays requires that the human reviews the generated content it it is fine for model training.”
You said in a session that you used content created by humans at the moment, but it might change in the future if AI can create high-quality content.
“I don’t think that we are going to change our guidelines anytime soon about whether you need to review it or not. So basically when we say that it’s human I think the word human created is wrong. Basically it should be human curated. So basically someone had some editorial oversight over the content and validated that it’s actually correct and accurate.”
Cloudflare recently launched a “Paywall Pro” which uses HTTP 402 Payment Required status code (under development). Would you expect this new technology to benefit both publishers and AI companies, including Google?
“Honestly, I don’t have thoughts on this yet because I was doing a another event when this was announced. I haven’t had time to digest it yet.”
Now, many big publishers have decided to block Google Extended. Would it negatively impact Google’s training (in general)?
“In general I have no idea because I basically the models are generally trained by deep mind I never worked on deep mind from Google search perspective it’s I don’t think it causes any problem.”
Many publishers don’t want their content to be used as a training model. Personally, I don’t care about it at the moment; there is no perfect solution to solve it. How can we have more granular solutions in the near future, say, in a few years, for more granular control?
“So so there’s definitely ideas floating around. I’m involved in an IETF working group called AI preferences where we are talking about developing a standard that would allow publishers to granularly control what the content is can be used for. I don’t know where it’s going to go but it is something that we are working on with the IETF.
And it does have momentum where when or if it will be launched that’s to be seen. There’s lots of misconceptions still about AI even in the technical space. So for example related to inference there are lots of questions. There’s also concerns about people are blocking things that they don’t fully understand. then that might somehow decrease innovation in the space which I’m not saying is good or bad but it might happen. So yeah there’s lots of considerations to be to be taken into account and we don’t yet know where are we going to go with it but we are definitely working on it.”
You mentioned in a session that 404 pages don’t use up crawl budget. However, what if a site has a huge number of 404 pages? We know Google tries to crawl 404 pages from time to time to confirm they are still unavailable. Please clarify it: 404 pages don’t consume crawl budget, but I think it’s kind of rational because it’s very easy to point a lot of broken links to a competitor and then does that or should that mean that you eat up or you manage to eat up your competitor’s crawl budget?
“404 pages don’t consume crawl budget like but I I I think it’s kind of rational because it’s very easy to point a lot of broken links to a competitor and then does that or should that mean that you eat up or you manage to eat up your competitor’s crawl budget? It should not mean that. So basically the best way to avoid that is to basically don’t count 40 or four pages in the in in crawl budget.
I mean it does take up some scheduling but the like scheduling is generally not going to be a problem because once we discover that one pattern for example is generally 404 like one URL pattern then we will start crawling less and less from there. So basically I I don’t see how it would be a problem. One thing that I know that is causing problems for publishers because we get crawl crawl issues reports about this is when your 404 pages makes expensive operations on the website.
So for example, it issues multiple SQL Squies and then it’s waiting to for the data to come back and basically it’s eating up the resources of the server for useless pages. My recommendation for those cases is that you basically try the simplest 404 page that you can afford to have and go with that instead of expensive computationally expensive pages because 404s do happen and every now and then we will discover a large swat of 404s on your site and then we eat up your server resources accidentally.”
If there’s content where the content itself is legit, the sentences are legit, and there are a lot of images relevant to the content, but all of them are generated by AI. Will that content or the overall site be penalized or not?
“Nope. No. So AI generated image doesn’t impact the SEO. Not directly. So obviously when you put images on your site, you will have to sacrifice some resources to those images. And then but otherwise you are not going to I I don’t think that you’re going to see any negative impact from them. If anything you might get some traffic out of image search for them or video search or whatever. But otherwise it should just be fine.”
Is the number of views/shares on social media used as one of the ranking signals for SEO or in general?
“For this we have basically a very old very canned response and something that we learned or is based on something that we learned over the years or particularly one incident around 2014 the answer is no and for the future is also likely no and that’s because we need to be able to control our own signals and then if we are looking at external signals So for example, a social network’s signals that’s not in our control. So basically if someone on that social network decides to inflate the number, we don’t know if that inflation was legit or not and we have no way knowing that.”
AI mode is not available in Japan, but a lot of people are showing interest in it. Is there a chance that advertisements will be shown in AI mode?
“I have no idea. I never worked on ads. I’m assuming yes because I also didn’t know what’s going to happen with discover for example and then just yesterday I saw an ad in discover. I didn’t even know that we have discovers. So I’m assuming yes, but it’s not my field really and it’s not my decision for certain.”
Do you think it’s highly likely (that ads will be in AI mode)?
“I don’t know how to answer it because I I never worked on ads or with ads. So I I don’t know if it’s even a if it’s even one of those services where ads make sense.”
If AI mode gets much bigger and people mainly use AI mode for information, Google will lose revenue from classic search advertising. How are they going to make revenue?
“I don’t know but it’s not my it’s not my problem we have people to figure these things out so it’s really just not my problem how Google makes its money my job is to answer these kind of questions well not the ads questions but the previous questions and to the best of my ability and basically that’s it.”
Leave a Reply