Google Image search pagination returns useless page
complete
Illia Zub
complete
We have deployed the fix today. 21-100th results from the first page are also being extracted now.
@Alex Smith, sorry for the huge delay.
A
Alex Smith
Illia Zub awesome. Glad I could clarify. Thanks for all your hard work making this work!!
A
Alex Smith
Illia Zub I see what you're saying about it returning the HTML. What it's doing now, as mentioned here previously, is returning 20 results for the first response, then letting me use the "ijn" parameter to paginate through the results. The issue now is that you can't get results 21-100, because when you specify an ijn of 1, it starts at result 101. Does that make sense?
Illia Zub
Alex Smith: Thanks for your clarification. Yes, now I see the problem - we don't extract data from HTML when the page contains
The rest of the results might not be what you're looking for. See more anyway
message.The missing 80 results are included in the page, so we'll fix our parser to extract these results.
Thanks again for the explanation 👍
Illia Zub
bitsofinfo: It seems that Google changed something and pagination via
ijn
works again. When there are no results for the specified page number (ijn
) parameter, Google returns the HTML like in the original post. Usually there are up to ten pages per specific query.Are you still experiencing that error?
bitsofinfo
Experiencing the same, please fix!
Elizabeth Oster
bitsofinfo: Hi, sorry about the issue.
We've been investigating this issue and found that Google has changed a lot about how they do pagination which is why it's no longer working with our API.
We're still working out the best route to take to restore pagination function the best we can. But for now at least, it looks like the
ijn
parameter won't be the way to paginate on Google images anymore.Julien Khaleghy
More details:
We've looked at it with Milos yesterday. Images page changed a lot.
Pagination now performed by POST request to url looking like
https://www.google.com/_/VisualFrontendUi/data/batchexecute?rpcids=HoAMBc&f.sid=-4610710633484194050&bl=boq_visualfrontendserver_20201117.11_p0&hl=uk&authuser&soc-app=162&soc-platform=1&soc-device=1&_reqid=758918&rt=c
with parameters looking like f.req=[[["HoAMBc","[null,null,[3,null,697,1,1680,[[\"ok7KWBqfT988VM\",299,168,33652073]],[],[],null,null,null,738,300,[]],null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,null,[\"cat\"],null,null,null,null,null,null,null,null,[\"CgUIJRCsAgoECBoQAw==\",\"CAM=\",\"CgtHUklEX1NUQVRFMBAaIAA=\"]]",null,"generic"]]]
and a huge response. Images could be retrieved from that response.Things to be done with new pagination design:
- find out what query params affect the result
- find out what body params affect the result
- parse the response and extract data
- modify utils to perform post request
- find out if we are able to retrieve page number nwithout getting previous/first page(to get some values to be used as parameters
Things to consider:
- will take decent time to investigate and implement
- making same request twice doesn't seem to return the same data
- no html might be provided for debug
### Alternatively we found a way to retrieve a page without new pagination. Might be arguably best solution in a long term, but looks like the best solution in a short term.
To retrieve such page
gbv=1
parameter should be added to url.Things to be done with such page design:
- add gbvparam
- remove ijnand usestartinstead OR treatijnasstartwhen we inject url string(this kind of design usesstartparameter for offset)
- rewrite parser
Things to consider:
- gives 20 results per page versus 100 results per page in updated design and most likely there is no numavailable
I might have missed some info, will update if comes to my mind. Also I guess Milos will give some details that I've missed.
Also will post more findings on a new design if we decide to choose it.
Here is the screenshot of the page when using
gbv=1
(page with disabled javascript)
As igor pointed out, this would be the easiest and quickest solution. But the obvious drawback is getting 20 results per page instead of 100.
Elizabeth Oster
planned
Elizabeth Oster
under review