1- # PaGoDo - Passive Google Dork
1+ # pagodo - Passive Google Dork
22
33## Introduction
44
5- pagodo automates Google searching for potentially vulnerable web pages and applications on the Internet. It replaces
5+ ` pagodo ` automates Google searching for potentially vulnerable web pages and applications on the Internet. It replaces
66manually performing Google dork searches with a web GUI browser.
77
88There are 2 parts. The first is ` ghdb_scraper.py ` that retrieves the latest Google dorks and the second portion is
99` pagodo.py ` that leverages the information gathered by ` ghdb_scraper.py ` .
1010
11- HakByte created a video tutorial on using pagodo. It starts around 8 minutes in and you can find it here
12- < https://www.youtube.com/watch?v=lESeJ3EViCo&t=481s >
11+ The core Google search library now uses the more flexible [ yagooglesearch] ( https://github.com/opsdisk/yagooglesearch )
12+ instead of [ googlesearch] ( https://github.com/MarioVilas/googlesearch ) . Check out the
13+ [ yagooglesearch README] ( https://github.com/opsdisk/yagooglesearch/blob/master/README.md ) for a more in-depth explanation
14+ of the library differences and capabilities.
15+
16+ This version of ` pagodo ` also supports native HTTP(S) and SOCKS5 application support, so no more wrapping it in a tool
17+ like ` proxychains4 ` if you need proxy support. You can specify multiple proxies to use in a round-robin fashion by
18+ providing a comma separated string of proxies using the ` -p ` switch.
1319
1420## What are Google dorks?
1521
1622Offensive Security maintains the Google Hacking Database (GHDB) found here:
1723< https://www.exploit-db.com/google-hacking-database > . It is a collection of Google searches, called dorks, that can be
18- used to find potentially vulnerable boxes or other juicy info that is picked up by Google's search bots.
24+ used to find potentially vulnerable boxes or other juicy info that is picked up by Google's search bots.
25+
26+ ## Terms and Conditions
27+
28+ The terms and conditions for ` pagodo ` are the same terms and conditions found in
29+ [ yagooglesearch] ( https://github.com/opsdisk/yagooglesearch#terms-and-conditions ) .
30+
31+ This code is supplied as-is and you are fully responsible for how it is used. Scraping Google Search results may
32+ violate their [ Terms of Service] ( https://policies.google.com/terms ) . Another Python Google search library had some
33+ interesting information/discussion on it:
34+
35+ * [ Original issue] ( https://github.com/aviaryan/python-gsearch/issues/1 )
36+ * [ A response] ( https://github.com/aviaryan/python-gsearch/issues/1#issuecomment-365581431> )
37+ * Author created a separate [ Terms and Conditions] ( https://github.com/aviaryan/python-gsearch/blob/master/T_AND_C.md )
38+ * ...that contained link to this [ blog] ( https://benbernardblog.com/web-scraping-and-crawling-are-perfectly-legal-right/ )
39+
40+ Google's preferred method is to use their [ API] ( https://developers.google.com/custom-search/v1/overview ) .
1941
2042## Installation
2143
@@ -24,7 +46,7 @@ Scripts are written for Python 3.6+. Clone the git repository and install the r
2446``` bash
2547git clone https://github.com/opsdisk/pagodo.git
2648cd pagodo
27- virtualenv -p python3 .venv # If using a virtual environment.
49+ virtualenv -p python3.7 .venv # If using a virtual environment.
2850source .venv/bin/activate # If using a virtual environment.
2951pip install -r requirements.txt
3052```
@@ -35,12 +57,13 @@ To start off, `pagodo.py` needs a list of all the current Google dorks. The rep
3557the current dorks when the ` ghdb_scraper.py ` was last run. It's advised to run ` ghdb_scraper.py ` to get the freshest
3658data before running ` pagodo.py ` . The ` dorks/ ` directory contains:
3759
38- * the ` all_google_dorks.txt ` file which contains all the Google dorks
39- * Individual dork category dorks
60+ * the ` all_google_dorks.txt ` file which contains all the Google dorks, one per line
61+ * the ` all_google_dorks.json ` file which is the JSON response from GHDB
62+ * Individual category dorks
4063
4164Dork categories:
4265
43- ``` none
66+ ``` python
4467categories = {
4568 1 : " Footholds" ,
4669 2 : " File Containing Usernames" ,
@@ -59,27 +82,18 @@ categories = {
5982}
6083```
6184
62- Fortunately, the entire database can be pulled back with 1 HTTP GET request using ` ghdb_scraper.py ` . You can dump all
63- dorks to a file, the individual dork categories to separate dork files, or the entire json blob if you want more
64- contextual data about each dork.
65-
6685### Using ghdb_scraper.py as a script
6786
68- To retrieve all dorks:
69-
70- ``` bash
71- python ghdb_scraper.py -j -s
72- ```
73-
74- To retrieve all dorks and write them to individual categories:
87+ Write all dorks to ` all_google_dorks.txt ` , ` all_google_dorks.json ` , and individual categories if you want more
88+ contextual data about each dork.
7589
7690``` bash
77- python ghdb_scraper.py -i
91+ python ghdb_scraper.py -s -j - i
7892```
7993
8094### Using ghdb_scraper as a module
8195
82- The ` ghdb_scraper.retrieve_google_dorks() ` returns a dictionary with the following data structure:
96+ The ` ghdb_scraper.retrieve_google_dorks() ` function returns a dictionary with the following data structure:
8397
8498``` python
8599ghdb_dict = {
@@ -105,75 +119,132 @@ dorks["category_dict"].keys()
105119dorks[" category_dict" ][1 ][" category_name" ]
106120```
107121
108- ## pagodo.py
122+ ## < span > pagodo.py</ span >
109123
110- Now that a file with the most recent Google dorks exists, it can be fed into ` pagodo.py ` using the ` -g ` switch to start
111- collecting potentially vulnerable public applications. ` pagodo.py ` leverages the ` google ` python library to search
112- Google for sites with the Google dork, such as:
124+ ### Using <span >pagodo.py</span > as a script
113125
114- ``` none
115- intitle:"ListMail Login" admin -demo
126+ ``` bash
127+ python pagodo.py -d example.com -g dorks.txt
116128```
117129
118- The ` -d ` switch can be used to specify a domain and functions as the Google search operator:
130+ ### Using pagodo as a module
119131
120- ``` none
121- site:example.com
122- ```
132+ The ` pagodo.Pagodo.go() ` function returns a dictionary with the data structure below (dorks used are made up examples):
123133
124- Performing ~ 4600 search requests to Google as fast as possible will simply not work. Google will rightfully detect it
125- as a bot and block your IP for a set period of time. In order to make the search queries appear more human, a couple of
126- enhancements have been made. A pull request was made and accepted by the maintainer of the Python ` google ` module to
127- allow for User-Agent randomization in the Google search queries. This feature is available in
128- [ 1.9.3] ( https://pypi.python.org/pypi/google ) and allows you to randomize the different user agents used for each search.
129- This emulates the different browsers used in a large corporate environment.
134+ ``` python
135+ {
136+ " dorks" : {
137+ " inurl:admin" : {
138+ " urls_size" : 3 ,
139+ " urls" : [
140+ " https://github.com/marmelab/ng-admin" ,
141+ " https://github.com/settings/admin" ,
142+ " https://github.com/akveo/ngx-admin" ,
143+ ],
144+ },
145+ " inurl:gist" : {
146+ " urls_size" : 3 ,
147+ " urls" : [
148+ " https://gist.github.com/" ,
149+ " https://gist.github.com/index" ,
150+ " https://github.com/defunkt/gist" ,
151+ ],
152+ },
153+ },
154+ " initiation_timestamp" : " 2021-08-27T11:35:30.638705" ,
155+ " completion_timestamp" : " 2021-08-27T11:36:42.349035" ,
156+ }
157+ ```
130158
131- The second enhancement focuses on randomizing the time between search queries. A minimum delay is specified using the
132- ` -e ` option and a jitter factor is used to add time on to the minimum delay number. A list of 50 jitter times is created
133- and one is randomly appended to the minimum delay time for each Google dork search.
159+ Using a Python shell (like ` python ` or ` ipython ` ) to explore the data:
134160
135161``` python
136- # Create an array of jitter values to add to delay, favoring longer search times.
137- self .jitter = numpy.random.uniform(low = self .delay, high = jitter * self .delay, size = (50 ,))
162+ import pagodo
163+
164+ pg = pagodo.Pagodo(
165+ google_dorks_file = " dorks.txt" ,
166+ domain = " github.com" ,
167+ max_search_result_urls_to_return_per_dork = 3 ,
168+ save_pagodo_results_to_json_file = True ,
169+ save_urls_to_file = True ,
170+ verbosity = 5 ,
171+ )
172+ pagodo_results_dict = pg.go()
173+
174+ pagodo_results_dict.keys()
175+
176+ pagodo_results_dict[" initiation_timestamp" ]
177+
178+ pagodo_results_dict[" completion_timestamp" ]
179+
180+ for key,value in pagodo_results_dict[" dorks" ].items():
181+ print (f " dork: { key} " )
182+ for url in value[" urls" ]:
183+ print (url)
138184```
139185
140- Latter in the script, a random time is selected from the jitter array and added to the delay.
186+ ## Tuning Results
141187
142- ``` python
143- pause_time = self .delay + random.choice(self .jitter)
188+ ## Scope to a specific domain
189+
190+ The ` -d ` switch can be used to scope the results to a specific domain and functions as the Google search operator:
191+
192+ ``` none
193+ site:github.com
144194```
145195
146- Experiment with the values, but the defaults successfully worked without Google blocking my IP. Note that it could take
147- a few days (3 on average) to run so be sure you have the time.
196+ ### Wait time between Google dork searchers
197+
198+ * ` -i ` - Specify the ** minimum** delay between dork searches, in seconds. Don't make this too small, or your IP will
199+ get HTTP 429'd quickly.
200+ * ` -x ` - Specify the ** maximum** delay between dork searches, in seconds. Don't make this too big or the searches will
201+ take a long time.
202+
203+ The values provided by ` -i ` and ` -x ` are used to generate a list of 20 randomly wait times, that are randomly selected
204+ between each different Google dork search.
205+
206+ ### Number of results to return
207+
208+ ` -m ` - The total max search results to return per Google dork. Each Google search request can pull back at most 100
209+ results at a time, so if you pick ` -m 500 ` , 5 separate search queries will have to be made for each Google dork search,
210+ which will increase the amount of time to complete.
211+
212+ ## Google is blocking me!
213+
214+ Performing 6500+ search requests to Google as fast as possible will simply not work. Google will rightfully detect it
215+ as a bot and block your IP for a set period of time. One solution is to use a bank of HTTP(S)/SOCKS proxies and pass
216+ them to ` pagodo `
217+
218+ ### Native proxy support
148219
149- To run it:
220+ Pass a comma separated string of proxies to ` pagodo ` using the ` -p ` switch.
150221
151222``` bash
152- python3 pagodo.py -d example.com - g dorks.txt -l 50 -s -e 35.0 -j 1.1
223+ python pagodo.py -g dorks.txt -p https://myproxy:8080,socks5h://127.0.0.1:9050,socks5h://127.0.0.1:9051
153224```
154225
155- ## Google is blocking me!
226+ You could even decrease the ` -i ` and ` -x ` values because you will be leveraging different proxy IPs. The proxies passed
227+ to ` pagodo ` are selected by round robin.
228+
229+ ### proxychains4 support
156230
157- If you start getting HTTP 429 errors, Google has rightfully detected you as a bot and will block your IP for a set
158- period of time. The solution is to use proxychains and a bank of proxies to round robin the lookups.
231+ Another solution is to use ` proxychains4 ` to round robin the lookups.
159232
160- Install proxychains4
233+ Install ` proxychains4 `
161234
162235``` bash
163236apt install proxychains4 -y
164237```
165238
166239Edit the ` /etc/proxychains4.conf ` configuration file to round robin the look ups through different proxy servers. In
167- the example below, 2 different dynamic socks proxies have been set up with different local listening ports
168- (9050 and 9051). Don't know how to utilize SSH and dynamic socks proxies? Do yourself a favor and pick up a copy of
169- [ Cyber Plumber's Handbook and interactive lab] ( https://gumroad.com/l/cph_book_and_lab ) to learn all about Secure Shell
170- (SSH) tunneling, port redirection, and bending traffic like a boss.
240+ the example below, 2 different dynamic socks proxies have been set up with different local listening ports (9050 and
241+ 9051).
171242
172243``` bash
173244vim /etc/proxychains4.conf
174245```
175246
176- ``` bash
247+ ``` ini
177248round_robin
178249chain_len = 1
179250proxy_dns
@@ -185,14 +256,30 @@ socks4 127.0.0.1 9050
185256socks4 127.0.0.1 9051
186257```
187258
188- Throw ` proxychains4 ` in front of the Python script and each lookup will go through a different proxy (and thus source
189- from a different IP). You could even tune down the ` -e ` delay time because you will be leveraging different proxy boxes .
259+ Throw ` proxychains4 ` in front of the ` pagodo.py ` script and each * request * lookup will go through a different proxy (and
260+ thus source from a different IP).
190261
191262``` bash
192- proxychains4 python3 pagodo.py -g ALL_dorks .txt -s -e 17.0 -l 700 -j 1.1
263+ proxychains4 python pagodo.py -g dorks/all_google_dorks .txt -o -s
193264```
194265
195- ## Conclusion
266+ Note that this may not appear natural to Google if you:
267+
268+ 1 ) Simulate "browsing" to ` google.com ` from IP #1
269+ 2 ) Make the first search query from IP #2
270+ 3 ) Simulate clicking "Next" to make the second search query from IP #3
271+ 4 ) Simulate clicking "Next to make the third search query from IP #1
272+
273+ For that reason, using the built in ` -p ` proxy support is preferred because, as stated in the ` yagooglesearch `
274+ documentation, the "provided proxy is used for the entire life cycle of the search to make it look more human, instead
275+ of rotating through various proxies for different portions of the search."
276+
277+ ## License
278+
279+ Distributed under the GNU General Public License v3.0. See [ LICENSE] ( ./LICENSE ) for more information.
280+
281+ ## Contact
282+
283+ [ @opsdisk ] ( https://twitter.com/opsdisk )
196284
197- Comments, suggestions, and improvements are always welcome. Be sure to follow [ @opsdisk ] ( https://twitter.com/opsdisk )
198- on Twitter for the latest updates.
285+ Project Link: [ https://github.com/opsdisk/pagodo ] ( https://github.com/opsdisk/pagodo )
0 commit comments