Web Research Techniques

8 minute read

Advanced Google

Think of Google less as an oracle that answers your questions and more as the largest index of web content that you can access via a query language. Here’s a good primer on advanced Google queries. You’re probably familiar with most of it, but you’re also probably overlooking ways you could be using them to solve your everyday problems. Google becomes a much more powerful tool when you treat it as merely an index.

Examples:

  • You’re trying to find someone’s email. The query “FIRSTNAME LASTNAME “@DOMAIN.com”” is often better than email-finding tools like hunter.io.
  • Searching social media platforms via Google is often more effective than the search the platform provides. Eg: “model nyc site:instagram.com” or “site:linkedin harvard 2024 founder”.
  • You’re trying to cut through to high quality information on a topic. Pick the best publication that writes on the topic and search “site:PUBLISHER.com “KEYWORD””.

My favorite way to get background information on a topic is to read articles about it from the New York Times (or whatever publication is most relevant) in chronological order. You can sort results by date if you follow the instructions at the link.

Here’s a massive list of “Google dorks”, search queries that are specifically designed to turn up a specific type of result. Most of these are trying to find websites that use specific technology (eg if you have a vulnerability for a specific web technology, you use a Google dork to find websites that use that vulnerable tech). But of course this can be used for good too.

AbandonedAssets.io uses Google dorks to find websites that the owner has stopped updating and that may be purchased at low cost. The creator uses search queries like “copyright 2019” to find sites that haven’t been updated in a while. Google is incredibly powerful when you get creative.

You’ll probably want to search for content in a website’s html, not just text. Google doesn’t let you do that, but PublicWWW does. I’ve found this useful for searching for sites that contain certain kinds of links (eg. discord invites).

Start thinking of Google as a terminal to query and your powers just grew stronger.

This tip is related to advanced Google searches because it also involves web indexes. There are tons of sites (I use Ahrefs) that will give you a list of webpages that link to a given url.

This is useful for:

  • Once I found an amazing hidden gem on the internet I wished I’d found sooner. I asked myself “where would I have to have been on the internet to have found this sooner?”. I got the backlinks to the url and I found some other cool stuff I definitely never would have stumbled upon.
  • If you want to see how a piece of content is being spread, the backlinks will show you. I was curious how much traction a company’s PR post was getting, so I got the backlinks and was able to reverse engineer their promotion strategy.
  • Someone sent me a company that was raising a seed, and I wanted to gauge how well known they were. I got the backlinks to their landing page, which helped me to understand their history on the internet.

Google Images and TinEye let you search for copies of a given image. This is useful for when you want to get context on an image, such as when it was originally posted online and what it’s from. You can also use reverse image search to find higher resolution and non-watermarked versions of photos.

Social Media Tools

I’ve already showed how we can use Google to search the “walled garden” of social media. But there are lots of tools that have been built for social media, primarily to be used by marketers and community managers. If you’re trying to do something unusual with social media, it’s worth searching around for a tool.

Here are some Twitter tools I’ve used:

  • Twitter Advanced Search is pretty good. Just like Google dorks, this can be a powerful tool if you know how to use it creatively.
  • Followerwonk Bio Search lets you search for users whose bios contain keywords. I’ve used this to find people who work at a given company or are associated with a given network/fellowship/etc.
  • This tool only shows you the first 50 results for free, but you could scrape this yourself with a google dork: “site:twitter.com -inurl:status -inurl:statuses -inurl:with_replies -inurl:hashtag “@spacex””.
  • Twitual.com shows you a user’s mutuals (for low-follower accounts only).
  • Twiangulate offers various analyses - a user’s biggest followers, who users follow in common, and who follows all of a given set of users.

There are tools that help you find social media accounts (ChannelCrawler, Audiense, NinjaOutreach, SocialBlade). There are tools that help you find trending content across social media and the web (BuzzBundle, Mention, BrandMentions, BuzzSumo). There are tools for finding publications in a given niche (Google News, BuzzStream… please let me know of any others). And there are analytics tools for most marketplaces (eg. Helium10 for Amazon, Airbtics for Airbnb).

Searching the Independent Web

A lot has been said about the declining quality of Google search results, particularly how hard it is to find personal blogs and random non-commercial sites. (My favorite example of how bloated and commercial the web has becomes comes from this cooking site that shows us how good the web could be).

There are several search engines specializing in filling this gap. Many have quite small indexes, and some may be dead by the time you’re reading this. Personally I haven’t had great success finding cool stuff this way (my experience is that searching Hacker News works best), but I support the idea behind these projects.

So if you want genuine info on a commonly SEO’d topic like “personal finance” or “cooking”, run your term through these search engines and you might find something good:

  • Metaphor.systems gets its index from links on Twitter, Reddit, etc. and uses LLM to query. It’s the best in this list.
  • Teclis only shows websites with no ads.
  • SearchMySite.net has its own index of user-submitted personal sites.
  • Marginalia Search returns sites focus on “non-commercial” sites. I’m unclear exactly how it works but the results are good.
  • Wiby tends to return really weird or cool indie sites.
  • BlogSurf.io gets its index from links on Twitter, Reddit, and HN, then weighs them by engagement. (Read more).

Website Research

Suppose that you want to investigate a webpage that you have absolutely no context on. In addition to backlinks, here are the places I’d check for info:

  • Whois shows you the contact information of the person who registered the domain. In the last few years it’s become very common for domain registrars to let users shield their details from the whois records. But it at least shows you which registrar they used.
  • Whoisology maintains an archive of domains’ past whois records, but their records are incomplete.
  • ViewDNS.info provides a lot of searches. For example you can get the IP of the server the website is hosted on and then find other domains hosted on that server.
  • BuiltWith.com shows what technology a website is built with. I’ve used this to reverse engineer particularly well-made sites and to gauge the technical sophistication of the team behind a site.
  • SimilarWeb.com estimates a website’s traffic and similar basic metrics.
  • SpyFu has information about which Google keywords the website is ranking for and bidding for, and how much they’re spending.
  • I’d also look for the website’s sitemap. Often this can be found at www.domain.com/robots.txt or by querying Google.

Piracy

Sometimes you need access to something but it’s paywalled. Under certain circumstances, it is ethical to pirate what you need. (Check with your local religious official or academic ethicist for moral guidance). Here are my piracy suggestions:

For bypassing paywalls like on the New York Times or Wall Street Journal, the Internet Archive and Archive.Today often work. They’re also very useful for finding old caches of content at a given url, such as when an old comment links to a url that is now dead or has missing content.

  • Libgen.io is the best source for books. Kindle lets you email PDFs to a unique email address, so it’s easy to transfer books from your computer to your Kindle or the Kindle app.
  • Sci-hub.hk will give you any journal article you want for free.
  • I think ThePirateBay.org is still the best place to find software, movies/TV, etc. Often times if you’re looking for a file that you can’t find anywhere, you can find it on the Bay (eg. the Star Wars Christmas special or the complete archives of Playboy magazine).

To download files from the Bay, you’ll need a torrent client. I recommend BitTorrent. It is unheard of for someone to face repercussions for pirating content for personal use, but it’s bad manners to torrent on someone else’s wifi. If you torrent something that just came out, your ISP might send you an email telling you to not do that again. If it makes you feel better, you could get a VPN. For security reasons, I can’t recommend the VPN I use.