The Burden of Truth
Discussion, opinion, information, and more opinion on search technology, the net, and the world.

Burden of Truth

Wednesday, June 25, 2003


GOOGLE'S LOST INDEX FILES - YOU DECIDE

The stress levels amongst webmasters and web promotion professionals are obvious at present. The message boards and newsgroups are humming with speculation, with a hint of trepidation.

Sometimes a Google employee steps into the fray and offers some calming insights and advice. Apparently not this time though - at least not thus far.

However, at times like this it is sometimes wise to look at the overall alternatives from a slightly higher perspective. I would suggest that there are THREE possibilities to explain the current search result flux (see footnote):

1) The 'Twilight Zone' proposition is correct, or on the correct lines. For anyone not familiar, this is the suggestion that Google is currently unable to manage all of the most recent link data, and shuffles it in and out of the index, perhaps using FreshBot. There are plenty of supporting arguments for this.

2) This is part of the expected and planned technology changeover cycle, in which case, the data centers will eventually align, and as the Google employee suggests, all the missing data will then be added. Normal looking search returns will eventually resume.

3) This is planned, and calculated, and permanent. No... I don't really want to think that either, because of the implications... these being that maybe the Google employee himself was mislead when he was told that there was no algorithmic tweak or filter affecting index pages (which he subsequently passed on to webmasters).

This would mean that in the bigger context the so-called Google 'social contract' has been unilaterally ripped up: Following the Google 'philosophy' of content-content-content and seeking relevant links will ultimately have been a pathway to oblivion for a lot of people (certainly on the most commercial terms). The rather sinister logic is that the main target terms have been purposefully selected by Google for this attention because they are of most commercial value. If the search results on these terms are relatively poor, this will drive up Adword clicks and ultimately revenue for the company. This conspiracy theory is far from new.


TAKE YOUR PICK
I'm personally sticking with the first two for now. Why? It's in the dates.

If this was a global tweak or filter of some sort, those sites whose link structures have been there for years would also be equally affected by the flux. They aren't. This fact actually re-enforces the 'trusted base' idea, with an outer twilight region (the TZ theory).

Also, for the second possibility, the Google employee did present a sort of roadmap to support it. His postings didn't preclude the first proposition by the way, but the roadmap did suggest that missing data would be added close to the end. But when is the end? Is it really in July as some people now suggest? Obviously I don't know the answer.

For the third, that type of theory has frequently raised its head before, but frankly I doubt it. The problem is, the longer things go unchecked, the more likely it will seem to many people, and the odds will ultimately change.

It's you're choice really... take your pick.


[FOOTNOTE: The main fluctuation is essentially that for many web sites the main index page doesn't feature on the most popular industry term. Sub-pages are available, but tend to attract much lower ranking.]



Friday, June 20, 2003


GOOGLE'S STORMY WATERS

Life is always hectic when Google is re-indexing, but never previously as traumatic as with this cycle. Frankly, the index is all over the place.

Two issues have emerged:

a) Google has forgotten the importance of a site's index page. Hundreds of thousands of sites have had their main index page ditched or relegated on their main search terms. Sub-pages from the same sites still appear, but obviously with much lower ranking.

Webmasters are faced with their contact page or links pages (for example) being presented as the entry page for visitors. Ooops!

b) The index is still not up to date. It seems to have moved along slightly, but recent links are prone to come and go, almost within the hour.

The distinction between 'fresh' (news) and core index has become blurred. Core index fodder is being treated like temporary fresh fodder. Many new sites are still nowhere to be seen, or appear only fleetingly.


And no, it isn't just the little guy who is currently suffering. A major news group aggregation firm, EasyNews, for example, was recently forced to offer a reward to anyone who could help them out of the mire. Incredibly, they have been ditched on the search term 'Easynews'. They are far from being alone, even though other factors seem to apply in their case.


Trying to establish some pattern is proving to be extremely difficult, although some have tried. Indeed, there may not even be a pattern. Some of these features may indeed be unforeseen by Google itself, as it continues to wrestle with it's new software.

With the ins and outs constantly occurring, many webmasters are clinging to the hope that all the missing links/etc will be added as the final ingredient to the pot when the new index is distributed across all 9 data centers.

Only time will tell whether that will happen, but certainly, the waters are very rough at present. One would imagine that Google is aware that the currently situation isn't exactly healthy for them either - or maybe not.

Sunday, June 15, 2003

WEB HOSTING NIGHTMARES

One of the worst occurrences for any webmaster or search engine optimizer is of course web site unavailability. Yes, it happens. Unfortunately, sometimes all too often, and sometimes for far too long. Not all web hosts are equal - FAR from it.

For my pains I have a significant number of hosts. Some are excellent, some are fine and some are... let's just say there is plenty of room for improvement.

The first group of hosts provide almost 100% availability, decent speeds, and are there when I need them. These are probably the big 3 needs in web hosting terms, and a host that fails to make the grade for any one of them is likely to be a real headache.

But how do you go about ensuring that you are not stuck with bad host? How do you know they are bad unless you actually try them? Well fortunately, there are certain tell tale signs.

Before I list a few though - a golden rule: NEVER let your host acquire and administer your domain name for you. That's a definite no no. If you hit problems downstream the last thing you will need is trouble getting out of there and pointing your DNS at another host's servers. Allowing your host to control your domain name is like blocking your own escape route. Don't even think about it.

OK.... what about that list? Consider the following:

1) Does the host have proper contact details on their web site? This MUST include a telephone number. Apart from the fact that dodgy hosts often don't give their number, you won't want to sit on your hands if your site is down and you get no response via email.

2) Check their reputation. Take a look at their own forum, if they have one, and somewhere like www.webhostingtalk.com. Do the homework.

3) Check out their infrastructure. If they are decent they are likely to have this information on their site. For an important site you are unlikely to want a reseller of a reseller of a reseller. Know something about where and how your site will physically be hosted.

4) Send them an email asking a question. How quickly did they respond? How professional was their response?

These four are an absolute minimum, and go side by side with cost and the package on offer. Ducking the responsibility of checking these things could well come back to haunt... your site could be down when Googlebot is crawling, or when a customer is buying. It's not a great feeling to be sat upon a lengthy outage with matters outside your control.


GOOGLE DEVELOPMENTS

No sign of a traditional deepcrawl, but hints are emerging from the Plex that we may see a re-index based upon deeper than usual freshbot crawls in the next week or two. This will certainly help matters, at least to a degree, even though it may still be lacking the usual thoroughness in mapping the current web.

With all 9 data centers also largely aligned at present, the signs are good for at least some update activity in the very near future. We can only hope that this resolves most of the outstanding issues.

Thursday, June 12, 2003


IT'S IN THE DNA

I am sometimes asked why I rate Google so highly relative to most others. The answer is DNA (I'm not the first to use the term in this context by the way).

What I mean by that is simply that Google is a search technology company. It created the Google engine. It has fostered it, improved it, and expanded it over the years. It is obviously committed to it and is actually interested in search technology itself. As a result we see a product which is focused upon relevancy of returns - which is actually exactly what searchers want.

But what of the other main players?

Firstly, if you look at the main portals, they are not actually owned by search technology firms. MSN and AOL are also simply portals for example, using technology provided by others to deliver search functionality and other services. Yahoo owns its own directory, and has recently (relatively) bought the Inktomi search engine. But you still get the feeling that they see search as a side product rather than the core offering of the company (look at the way they have relegated the directory over the last couple of years).

All these firms offer search, but none have a core interest in driving search technology and search quality forward.

What about Alta Vista and AllTheWeb (FAST)? Certainly, these WERE search technology driven, but they have recently been bought out by Overture. Of course as a PPC (Pay Per Click) firm, Overture is very much an ADVERTISING company. That is its roots, and that is in its DNA. I don't really think it is going to suddenly acquire a cultural or academic interest in search technology.

Those two established engines will most likely have been procured to provide platforms for Overture PPC, or a platform to make PPC/Search bids to major portals. Sadly, I foresee a bleak future for both in terms of quality.

Of course there are other smaller search engines out there, some utilizing leading edge search technology and producing excellent results. The problems they face of course are visibility, scalability and market share. Some of them do have the right DNA though, and hopefully will eventually emerge to challenge Google in the quality stakes. Time, as ever, will tell.

However, at present, none of the leading players really has the focus or DNA to seriously challenge Google in terms of product quality. That isn't a healthy situation at all.


GOOGLEGUY COMMUNICATIONS

I've just been reading the Q&A by GoogleGuy at WMW. What always strikes me about it is not the information provided, which of course has to be limited, but that Google allows to happen. This open approach to stakeholder communication is actually very refreshing. It's a pity other major players aren't equally enlightened. It clearly benefits all parties.

Wednesday, June 11, 2003


MSN YELLOW PAGES

What is it with MSN? You'd think that with the massive resources of Microsoft behind them they might produce a search experience that was at least tenable.

I have just used MSN for the first time in a year or so. Even knowing what I do about Microsoft, and their well documented approach to business matters, I was surprised.

Basically it's just layer upon layer of adverts. I searched on 'flights' and was presented with:
- Featured Sites
- Sponsored Sites
- Web Directory Sites

I guess most people can work out that the top two sets of sites are adverts (PPC/etc). The third though is generally a set of adverts as well: the so called 'Directory' is actually Looksmart, who operate a rather strange PPC system. A LOT more on Looksmart will be logged here soon (and it ain't pretty!).

Essentially then, MSN present page after page of adverts for many searches. But why?

OK, they earn short term returns on the clicks... but do they really need it when it comes at the cost of driving people to Google? Who on earth is going to return to MSN having experienced the relatively clean search at Google?

I can only assume that they are feeding upon the naive - the non-net savvy. In others words, those new to the internet who don't actually realize yet that there are much better options out there. Not exactly a policy with much longevity then.

Perhaps their fundamental philosophy is that if they can make a good return on a bad product, then why bother to fix it? If you have an answer to that, send it to William Gates.


THE GOOGLE CONSPIRACY DEBATE
On my travels yesterday I came across a site explaining the alleged scalability issues in a little more depth. Try www.google-watch.org/broken.html

These guys obviously aren't big Google fans!

Monday, June 09, 2003


THE GREAT GOOGLE CONSPIRACY THEORY

Conspiracy theories usually emerge when there is a lack of confirmed data. It is hardly surprising, therefore, that the 'Broken Google' scenario has started to throw them up. The two most recent:

a) The Adwords Plot
This works on the basis that by DELIBERATELY presenting stale results, especially for new technology, those launching new products or services will be forced to buy Adwords (Google's PPC advertising system) as an alternative benefiting from the normal relevancy rankings. Stale results therefore equate to bigger profits for Google.

b) The Scalability Issue
This one runs along the lines that URLs (domain names) have a 4 byte stored identifier. This can only be used for around 4 billion pages, which Google is on the verge of reaching. Consequently, they need to upgrade to 5 bytes, which requires massive architectural database changes (allegedly). All hands to the pump! Note that some sources indicate that Google already has up to 40,000 Linux servers and are still buying them by the bucketful.

Stories like these will undoubtedly continue to proliferate whilst there is no clear timetable for update, and whilst basic questions remain unanswered: Why does much of the core database still contain Jan/Feb data? If there is no problem, why no update of the data now that the new software has been distributed? Why no sign of the deep Googlebot crawler? When will new data be introduced?

Silence of course can have different effects upon different people, and be interpreted in entirely different ways. Some may take it as a sign of confidence, or control, or unconcern. However, with speculation mounting, others will point out that if there really are no issues, it is not difficult for a company like Google to find a way to indicate this, or to offer an explanation... or at least to say something.

The certainty of course is that we have never witnessed anything quite like this since Google emerged. We therefore have little idea how they would actually manage a trauma, if indeed there ever was one. Nor have we really too much idea what is really going on at Mountain View, save the points made in my 5th June post.

It may indeed be time now for Google to comment and offer some transparency... but I'm not holding my breath! Alternatively, they could simply update that darned database!


SIDE NOTE: I know - this isn't a Google only log! Other search engine and online marketing issues will indeed receive full coverage and commentary shortly. For the last few days though, Google has certainly been worth watching, and reporting on. Non-Googlers please bear with it!

Saturday, June 07, 2003


IF GOOGLE SNEEZES...

I vaguely recall reading the phrase "If America sneezes Britain catches a cold", I believe pertaining to the economic ties between the two countries. I can't recall where this was, or even how accurate the quotation is. However, for some reason it springs to mind whenever I think of the current Google shenanigans.

As posted on the 5th, it is pretty clear that the Google engine is not running as smoothly as it normally does. The core database is quite a few months out of date, causing the engine to splutter on some searches. There is still no clear indication of when the requisite oil change will take place. It could be weeks, or even months, despite hints to the contrary from certain Google sources.

If this isn't wonderful news for Google users, such as myself, spare a thought for those whose livelihoods depend upon regular Google updates and a crisp new database: the search engine placement and promotion professionals (SEO's). These guys, who optimize web sites for others (simplifying it), are currently left high and dry, many no doubt having presented future projections to their clients before the current impasse arose.

This partially explains the frenzy around the nine Google data centers. A number of optimizers, and of course many web site owners, trawl over the returns from each center on a daily bases, no doubt hoping to find SOME evidence that changes are afoot and new sites are finally being properly introduced. There have been one or two minor shuffles over the last few days, but sadly, nothing to alleviate the pain for the SEO industry and those with sparkly new sites.

Webmaster message boards have been overflowing for weeks, many with desperate tales, and sometimes with a hint of barely suppressed anger. Certainly, it is an uncertain future, in an uncertain industry, or as one webmaster recently put it: "We are competing using a media which itself is competing. This is a recipe for instability".

Against the backdrop of Google domination of the search industry, this perhaps translates to: "If Google sneezes, webmasters catch a cold". The concern of some is that Google itself has caught a cold, or even more seriously, that regular searchers are starting to show symptoms.

Thursday, June 05, 2003


IS GOOGLE BROKEN?

If you follow Google closely, you may well have been wondering what has been going on over the last few weeks - many other people certainly have!

Some pretty bizarre results have been returned on certain searches. On others, unexpected surprises lurk (for example, a search on 'search engine' today produces a top return for Alta Vista, well ahead of Google itself). Searchers are also widely reporting an increase of dead or expired sites in the top 10.

So what IS going on at the GooglePlex at present?


Google isn't really broke, although many would argue it is 'wobbling' due to self inflicted wounds. Nor is it terminally ill. Here are the facts as best as we could establish:

a) Google introduced a new algorithm (a new software release) some 4 weeks or so ago

b) It began to roll out to its 9 data centers. The plan was to use the January/February index as a base, and then presumably to quickly upgrade to use the latest April crawl data. Note that we were on March data at the time, so they were actually taking a step backwards in time.

For anyone wondering why they chose this approach: wouldn't you try new software on old tried and tested data?

c) As is the nature of software there may well have been glitches (the details don't really matter if there indeed were any), but Whatever happened it took over 2 weeks to get everything relatively settled and all nine centers more or less returning the same results.

d) The software is now dispersed, but the April crawl data itself is already well out of date.

e) Google SEEMS to have decided to send out its Freshbot crawler instead of introducing this data. This crawler traditionally crawls only high profile or frequently changing sites, and does NOT perform a full crawl of the web, which is normally performed on a monthly cycle (approx). Freshbot currently seems to be crawling a little deeper than usual, perhaps trying to cover some missing ground, but again is far from being a full crawl.

f) For May, the Google update as we traditionally know it has therefore been skipped. Much of the core (non-Freshbot) index actually reflects January/February, yet we will be in the middle of summer soon!


So what does this mean?

- Well firstly, there are plenty of major sites missing or not reflected as prominently as they should be in the current Google returns.

- There are also more dead sites in the index (sites disappearing over the last month or two will still appear in Google regardless).

- Finally, sites that should normally have climbed the Google tree, perhaps due to changing events or other factors which might have recently increased their importance, have remained grounded. There are reports of major new technology sites not appearing for example.


No doubt Google are working hard to resolve all these issues. However, this perceived glitch in the normally smooth running of the search king will not have done them any favors. At the very least, by some they are now seen as fallible and prone to the same sort of problems as every other technology firm, especially as the above drama has been played out in full public view. It may also have offered encouragement to a number of their competitors.


As my default search provider as well, and I admit some bias, but here's hoping they sort this out quickly and introduce a full set of recent data very soon. I suspect that 'wobble' will probably turn out to be the most accurate description of events, and tales of Google demise a little premature.


SPAMMING THE INNOCENT

Like everyone else I'm dogged by spam. However, possibly the most amusing for me, because I am involved in the search technology sector, are those offering services such as: "guaranteed top 10 ranking on Google"; "submit your site to 5000 search engines"; etal.

One assumes some people fall for this, or the endless pap would not continue to flow. It does make you wonder though, whether they stop to think at all before they buy (and consequently, would a spam titled "Give Us Your Money" be equally effective?). Surely one simple thought would suffice: "If they can deliver the top 10 in Google cheaply and easily, why do they need to spam?". Or maybe they actually think they are the only recipients of the email!

Of course ANYONE can deliver a top 10 placement on a crawler based search engine:
- create a random character string, which is unlikely to be anywhere else on any site on the web
- insert the string onto your site
- get a link to your site from an established portal or directory.

In time you will be picked up and of course obtain the ranking on that invented search string... because no-one else is interested in it, no-one is searching on it, there are no competitors for it, and it is totally worthless.

The value of a top 10 ranking is clearly proportional to the value of the market the particular term represents, and the 'normal' conversion rate (eg: how many people buy and how much they spend) for that term.

One assumes many of the above spammers play around the edges of this reality, hoping their clients never grasp the real facts of web marketing. That is, assuming they actually understand the facts themselves.

Spamming and short changing the naive is something I will return to in due course.

Tuesday, June 03, 2003


RANDOM TEXT GENERATION
(Or: Who or WHAT wrote that article?)


Scratch the surface and sometimes you find a whole science lurking underneath.

That's precisely what happened when I began to think about automatically creating content for a web site (not that I'm lazy of course!). There is indeed a whole field of study out there.

First things first though: consider the following short texts:


1. The satellite beyond the legend
The moldy phantasm hurled some bulb at a splendor. I watched in horror as the fire behind the recording underhandedly explored the inferiority! A history was terrible. The polygon toward a tome laughs like a man insane, because a cosmic memory consumed the ooze toward the sanity. A source peeked at some bulb over a library, or a wedge toward another echo accidentally summoned a lethal Necronomicon. A blasphemous coin ignored a legend out of a lover. Some tattered figure was amphibian. The township engulfed a lethal egg. When a raspy creature is frozen, the exceedingly strange polygon ignored a terror about an automobile. When an unspeakable Elder Sign is bizarre, the wedge related to another mushroom seekd some darkness. A wedge around a blackness sows the seeds of its own damnation, and the burden peeked at the vaporized crane. Most people believe that a secret unleashed its power upon the febrile torch, but the inconceivable coffin is much more modern. When a building is indescribable, the anomaly related to another township satiated a horror. Most people believe that another scream living inside a nightmare evicerated a monolith, but the dreaded tomb is much more proverbial. When the thing hesitates, the rock behind a creature feels the squid. When a bohemian knowledge is bizarre, an engine beyond the delicacy reached an understanding with the fraction. The spirit was burly. Indeed, the unearthly history of the hole was secretly cryptic.

2. An almost putrid polygon
A temporal scream burned a proverbial blood clot. If a living case cleaned the manuscript, then the magnificent pyramid sleeps. Most people believe that an anomaly toward a tome shared its power with an ancient aversion, but the imaginative delicacy is much more temporal. Indeed, the horrible squid of a nearest memory was grotesque. At long last, the molten myth of the unbearable, feverishly non-euclidian statue was revealed! It was a secret living inside a blood clot, but now I had no choice but to accept the fact that an abstraction was indeed incinerated as well as usually typical! Remembering the somewhat ancient statue of the war, I prostated myself before the lantern of the the case that stood before me. A horror feverishly fainted at the very thought of the war inside the shadow. Most people believe that a splendor inexorably played horrible games with a nation beyond the vault, but the ostensibly green inferiority is much more unwittingly hideous. An unfamiliar blackness explianed a modern burden. Like a single-handledly dirt-encrusted the greedily human speech they befriended automobile, some anotomical, but others ostensibly or hardly bestowed great honor upon a torch. Like a usually bohemian the ocean they ostensibly competed with vista, some paternal, but others secretly or usually ostensibly bumped accidentally into a hole. Indeed, the orbiting ghoul of an inexorably unstable submarine was obscure. It was some demon, but now I had no choice but to accept the fact that a shocking myth was indeed paternal as well as carelessly unfathomed! A bizarre burglar shared its power with an abstraction. If the wisely terrible burglar hesitantly brainwashed the thing, then some burden prays. Remembering the thoroughly false note of an orbiting bulb, I prostated myself before the nation of the the pyramid that stood before me. The torch cleaned a submarine around a tomb. <<


Rambling rubbish perhaps, but both were created automatically, online, before my eyes. Try it for yourself - from the following web site you can create the same sort of stuff at the click of your mouse: http://www.darkicon.com/Library/randsent.htm

I think it's the content rather than the mechanism itself that is slightly disturbing!

Taking this further we begin of course to consider HOW it is done. Programming, rule bases and knowledge bases presumably. However, programmers might find the following to be of interest: http://www.perlmonks.org/index.pl?node_id=94856

This is designed to "generate credible sentences from a source text". It probably works differently to the above, but is interesting enough anyway. For more information still, including downloads: http://www-laog.obs.ujf-grenoble.fr/~lachaume/rant.html

Is this a man v machine issue? Well... no... not really. Certainly Bill Shakespeare can rest easily in his grave for the time being.

The first post. 17:45 GMT.

This page is powered by Blogger. Isn't yours?
 
Link To Us

Interesting?
If so, please
do link to us
from your
blog or site

Thank You.


Links

Google News
Blogger
Cosmos

OK, the links are boring! We are new, so more coming soon


Archives
06/01/2003 - 06/30/2003
08/01/2003 - 08/31/2003


Navigation

Contacts
Resources


Others

ActDumb
iHelpYouServices
SEF



Please do
link to us
from your
blog or site

Thank You