Searching Web Page Titles: intitle: and allintitle:

Each web page has a title – depending on your web browser, the title of a web page is shown as part of the tab or in the browser’s title bar. For example, in the picture below, the title of CNN.com is CNN – Breaking News, Latest News and Videos.

To search web pages with specific words in their titles, use the intitle: and allintitle: operator. For example, to search all web pages containing the word rome in their titles, you can search for:

intitle:rome

Suppose I was only interested in traveling to Rome – that entry to Wikipedia doesn’t help me figure out how to travel there. I might try the allintitle: operator, which searches for web pages containing all the words given in the title. For example:

allintitle:rome travel

The allintitle: operator works best with a few key words – remember, you’re searching web page titles which are usually short and to the point.

Finding Related Websites – The Related: Operator

As Google indexes the Internet, it can make connections between related websites and content. Take advantage of these connections by using the related: operator.

The related: operator shows related web sites. For example, if I search for related:chase.com, I’ll get a list of banks:

related:chase.com

This tool is useful when you’re trying to find competitor services. For example, if I was looking for a job, I would be looking for job sites to search postings and add my resume. I know that indeed.com is one job board. I can find other job sites by using the related: operator:

related:indeed.com

Google Doodle: Desi Arnaz

Today’s Google Doodle celebrates Desi Arnaz, best known as playing Ricky Ricardo in the TV show I Love Lucy. Here’s how the Google page looked like with the doodle:

The doodle itself:

Clicking on the doodle links you to a search for Desi Arnaz:

Clicking on the link to explore the life of Desi Arnaz brings you to a Google Arts & Culture article:

Listing Files Within A Bucket Folder – Python

Here’s a short code example in Python to iterate through a folder’s ( thisisafolder ) contents within Google Cloud Storage (GCS). Each filename can be accessed through blobi.name – in the below code sample, we print it out and test whether it ends with .json.

Remember that folders don’t actually exist on GCS, but a folder-like structure can be created by prefixing filenames with the folder name and the forward slash character ( / ).

    client = storage.Client()
    bucket = client.get_bucket("example-bucket-name")
    blob_iterator = bucket.list_blobs(prefix="thisisafolder",client=client)
    #iterate through and print out blob filenames
    for blobi in blob_iterator:
        print(blobi.name)
        if blobi.name.endswith(".json"):
            #do something with blob that ends with ".json"

Finding Interesting Files – The Filetype: Operator

Sometimes, a researcher needs to find something else other than a web page. News releases and raw data are often published for release as PDF files. Microsoft Powerpoint files (.PPTX) are often used to outline new company initiatives. Microsoft Word files (.DOCX) are shared while text is being edited/approved/discussed.

To find these files, the filetype: operator (or its alias, the ext: operator) can be used. For example, if I need to find official releases of employment data, a possible search would be one of the below:

employment data filetype:pdf
employment data ext:pdf
Searching for employment data.

As you can note from the red boxes above, all the results are of .PDF files – as the search query asked for.

The define: operator – A Replacement For The Dictionary

Google search is not just a great search engine, but also a great library of utility functions. An example of this is the define: operator.

The define: operator acts as a dictionary: it lets you ask for the definition of a word. For example, searching for the below text gives me the definition of this strange word:

define:defenestration

If you have a phrase you need to look up, feel free to throw it in as well. I wonder what this phrase means…

define:trip the light fantastic

I often use this function to look up domain-specific words, such as words used only in the legal or technology fields, and I’ve always found useful, intelligent definitions.

Limiting Your Search To A Single Site: The site: operator – Otherwise Known As My Favorite Operator

Perhaps the most known and used operator is the site: operator, which limits a search to a single site. For example, if I wanted to find all Disney related pages on Twitter, I might search for (remember, no spaces between site: and the site you’re searching):

disney site:twitter.com

As you can see, all the results are on twitter.com.

This operator is really useful on large sites that have poor search functionality – for example, searching Javadocs or social media sites such as Reddit.

Finding Old/Historical/Archived Content – The Cache Operator & Archive Services

Is your bookmark leading to an empty webpage? Did that link you found on a forum post dated 5 years ago no longer work? Perhaps you need some information from a site and it’s currently down for maintenance?

Fortunately, Google has you covered. The cache: operator shows you the given web page as Google saw it before. Using it is easy: type in cache: and then the URL you need to see. Make sure there is no space between cache: and the address.

As an example, see below:

cache:reddit.com

After you hit the search button, you’ll get something similar to this:

On some occasions, Google won’t be able to find a cached page, and you’ll see an image similar to the below:

In these cases, it’s time to pop over to archive.org and use the Wayback Machine: put the URL you want into the Wayback Machine prompt:

You’ll see options to select a year and a specific date: Click the blue circled dates to see the web page as it was on that date.

The Wayback Machine is useful for seeing historical snapshots of web pages as well, and seeing how web pages change through time.

Delete Old Entities – Java Datastore

This is an ultra-simplified example of how to delete old entities from the App Engine Datastore. The first 3 lines of code retrieves the current date, then subtracts 60 days from the current time (the multiplication converts days to milliseconds). DATE_PROPERTY_ON_ENTITY is the date property on the entity – when first writing the entity to the datastore, add the current date as a property. ENTITY_KIND is the entity kind we’re deleting.

		//Calculate 60 days ago.
		long current_date_long = (new Date()).getTime();
		long past_date_long = current_date_long - (1000 * 60 * 60 * 24 * 60);
		Date past_date = new Date(past_date_long);
		
		DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();
		Query.Filter date_filter = new Query.FilterPredicate("DATE_PROPERTY_ON_ENTITY", Query.FilterOperator.LESS_THAN_OR_EQUAL, past_date);
		Query date_query = new Query("ENTITY_KIND").setFilter(date_filter);
		PreparedQuery date_query_results = datastore.prepare(date_query);
		
		Iterator<Entity> iterate_over_old_entities = date_query_results.asIterator();
		
		while (iterate_over_old_entities.hasNext()) {
			Entity old_entity = iterate_over_old_entities.next();
			
			System.out.println("Deleting: " + old_entity.getProperties());
			
			datastore.delete(old_entity.getKey());
		}

Note that is a simplified function – it’s useful if you have a handful of entities that need deleting, but if you have more than a handful, you should convert to using datastore cursors and paging through entities to delete.

PHP Post To PubSub

Today is a rather large fragment demonstrating how to post to Google PubSub. While there are libraries to handle this, I prefer to understand the low-level process so debugging is easier.

Note that this fragment is designed to run on App Engine, as it relies on the App Identity service to pull the credentials required to publish to PubSub. You only need to set up 3 variables: $message_data, which should be a JSON-encodable object, NAMEOFGOOGLEPROJECT, which is the name of the Google project containing the pubsub funnel you want to publish to, and NAMEOFPUBSUB which is the pubsub funnel name.

It isn’t required, but it is good practice to customize the User-Agent header below. I have it set to Publisher, but a production service should have it set to an appropriate custom name.

use google\appengine\api\app_identity\AppIdentityService;

//Build JSON object to post to Pubsub

$message_data_string = base64_encode(json_encode($message_data));

$single_message_attributes = array ("key" => "iana.org/language_tag",
    "value" => "en",
);

$single_message = array ("attributes" => $single_message_attributes,
    "data" => $message_data_string,
);
$messages = array ("messages" => $single_message);

//Post to Pubsub

$url = 'https://pubsub.googleapis.com/v1/projects/NAMEOFGOOGLEPROJECT/topics/NAMEOFPUBSUB:publish';

$pubsub_data = json_encode($messages);

syslog(LOG_INFO, "Pubsub Message: " . $pubsub_data);

$access_token = AppIdentityService::getAccessToken('https://www.googleapis.com/auth/pubsub');

$headers = "accept: */*\r\n" .
    "Content-Type: text/json\r\n" .
    "User-Agent: Publisher\r\n" .
    "Authorization: OAuth " . $access_token['access_token'] . "\r\n" .
    "Custom-Header-Two: custom-value-2\r\n";

$context = [
    'http' => [
        'method' => 'POST',
        'header' => $headers,
        'content' => $pubsub_data,
    ]
];
$context = stream_context_create($context);
$result = file_get_contents($url, false, $context);

syslog(LOG_INFO, "Returning from PubSub: " . $result);