Sunday, May 31, 2009

Blogging Platform Survey Results

ProBlogger recently published a results of a new online poll about Blogging Platforms. Results showed differences between years 2006, 2007, and 2008. The biggest problem with this report is that ProBlogger does not tell us the number of respondents each year. We have no means to quantify a margin of error for this poll. Some of the user comments are also interesting:

From PraShawn - "I am amazed to see how many bloggers are using blogger.com. Thanks for sharing." - OK but how many. ProBlogger doesn't say?

From Twittipscenter - "Well, I for one voted for Wordpress......." - OK well now we know at least one person responded to the poll.

Darren Rowse does a great job with ProBlogger.net and it's a good blog to follow. I want to see better analysis for online surveys and polls. Please if you take the time to make a poll and communicate the results. Publish all the pertinent information. At a minimum the number of respondents and groups surveyed are important to the audience.

Wednesday, May 27, 2009

The Secret of Correlating Blog Posts with Sales!

Is it possible to determine your brand sales by monitoring social media? The answer is yes, but how precisely is yet to be determined. To help answer this question we looked at brand beer sales data from Convenience Store Decisions and blog post data by brand as indicated on Technorati.com and IceRocket.com. Key Word search and data are shown as follows:




Using Linear Regression Methods it almost seems like child play at this point to show the correlation between sales data and blog posts. Even blocking for the possible effect of using the two search engines proved not necessary. The differences between two search engines were not important with this small set of data. One thing that was important in fitting the model was to transform the sales data to its natural logarithm. In a previous post I discussed how as an item increases in popularity in the blogs an exponential effect is seen its postings. It appears this may be the case with this data as well.

The model that is developed is Ln(Case Sales) = 2.812+0.00044*BlogHits - 0.00138*Block.
The blocking variable is 0 for Technorati and 1 for IceRocket. The data and regression equation look as follows. There is an upward trend in sales as blog posts increase.



Here is the ANOVA table for the regression analysis. The correlation is significant with a p-value of 0.0005. The R-Square seems to indicate the model only explains about 58% of the variance in the data. Not the best model, but what is important is that the correlation exists.


So what can we infer from this analysis.

1) Blog posts about certain beer brands are correlated with sales of those brands.
2) We cannot infer any type of cause and effect relationship.

Blog posts don’t cause sales. Sales can’t cause the blog posts either. The endogenous variable (lurking variable) we are really trying to quantify here is the relative consumer sentiment about a brand. High consumer sentiment leads to both more sales and more blog posts about a particular product. Blog posts could in fact be an indicator of consumer sentiment and this is the value of the analysis.

There are some pitfalls for this analysis. First, we can’t say for certain that the blog traffic is positive or negative. Is Natural Light receiving more blog traffic because of negative consumer sentiment than positive? Perhaps there is another factor involved. Secondly, it may be wrong to extend this type of analysis to certain other types of consumer products. Products that don’t make it past a noise level in the blogosphere probably cannot be indexed in this fashion. What about the time consequence. If sentiment changes over time, how can we be assured this will be reflected in the blog posts just by frequency of keywords. Trending of keywords can help with this.

There is nothing new here. It’s already been established that we can easily trend, Technorati, Twitter, Ice Rocket and other social media posts for keywords and topical interest areas. The question is has anyone made an attempt to model this interest and correlate it directly with sales for their products?

My vision is that this type of analysis can be fine tuned to produce a consumer blog index for various types of products. Not based on voting for popularity of a product but based on chatter. If this information correlates well with sales then it has significant interest to businesses promoting products. Are producers interested in this data and are consumers interested in seeing this type of data?

Can this method be substituted for certain conjoint analysis? If I am producing cameras and want to know the distribution of colors to produce, can I Google it? Or use IceRocket? Would a search with a logical combination of words representing colors and cameras yield results? The answers could be obtained in hours and save thousands of dollars and time on market research.
I would love to here comments from those of you already engaged in this analysis.

Sunday, May 24, 2009

Joe DiMaggio - Thoughts on the Streak!

What are the odds of Joe DiMaggio's 56 game winning streak?

This week I received my Spring 2009 copy of CHANCE magazine from American Statistical Association. Don M. Chance wrote an excellent article evaluating hitting streaks in general but also looking at factors why Joe DiMaggio may have accomplished such a feat.

Of the top 50 all time hitters Chance identified Ty Cobb, Ed Delahanty and Willie Keeler as being the top three most likely to accomplish a 56 game hitting streak. Ironically none of these hitters did it. In fact the author points out that of those hitters who achieved a streak of 30 games or more (including DiMaggio), none of these players were in the top 100 hitters of all time.

Earlier this year in the NY Times Samual Arbesman and Steven Strogratz presented results of a simulation study involving Monte Carlo style analysis of 10,000 baseball universes. The most frequent streak was 51 games. The median streak was 53 games. Two thirds of the time the record streak was found to be between 50 and 64. So Joe DiMaggio's streak was to be expected in the normal course of baseball history. It was not an unusual event.

The question that cannot be answered yet; why did Joe DiMaggio accomplish this feat instead of one of the top hitters of all time? According to Don Chance, Ty Cobb had a 1-204 chance of accomplishing this streak. He was about four times more likely than DiMaggio who has been estimated to have had only a 1 - 826 chance in his career.

The factors that the study cannot address are the intangibles. The play of the Yankees at the time of the streak. Teams that do well encourage success among all the players and create situations where perhaps a streak can occur. If the team is hot the players are hot. Most of these analysis assume independence of each batting opportunity. However, if a player is in a streak is his hitting actually independent. Can the likelihood of a hit be greater if the player is hot or in the zone than at other times. What about pitching? How does average opponents pitching play into the probability of a streak.

In conclusion, we just don't know why it was Joltin' Joe who set the record. Don Chance, Samual Arbesman and Steven Strogatz have done some excellent statistical analysis of streaks and should be commended for the information. Joe DiMaggio was an excellent personality remembered for his popularity on the field and off the field.

Saturday, May 09, 2009

Social News Modeled like Bacterial Growth

It's well known phenomena in biology that cells reproduce with cellular fission a process where one cell reproduces to two and two change to four and so forth. Growth proceeds exponentially until the environment can no longer support the growth or deaths begin to initiate. Ultimately the growth slows until it reaches a stationary phase and then finally a death phase will occur when deaths exceed new births.



Modeling this growth process can be accomplished by use of the modified Gompertz equation:



where:

y = LN(N/No) - N is the final population count, No is the starting population.
mu(mean) - a rate constant indicating the maximum slope of the growth curve.
e - the mathematical constant approximately equal to 2.718
A - Represents a constant for the (peak value - the starting value)

This modified Gompertz equation was published in June 1990 by M.H. Zweitering, et al. in the Applied Environmental Microbiology. The article was entitled, "Modeling of the Bacterial Growth Curve".

OK so how does this apply to social media and Swine Flu or any other topic for that matter?

According to data obtained at the Technorati.com website, blog posts regarding the search terms "Swine Flu" took off beginning about April 22nd and reached a peak number of daily posts of 10884 on April 28th. Previous to this news event the average number of daily posts on the topic of swine flu was between four and five posts. Plotted on a trend chart the growth of the news expanded exponentially as depicted on the following chart.



The smoothed line on this curve represents the modeled equation for the Gompertz Curve. A quick linear regression of the growth portion of the curve shows a significant and high degree of correlation between the actual data and the modeled curve. Greater than 95% of the variance in the data can be accounted for with the Gompertz model.

There is an interesting kinetic parameter known as the doubling time. The doubling time represents the amount of time for a population to double. It is calculated as the LN(2)/slope or mean growth.

In this example of swine flu the doubling time works out to be 16.5 hours. The practical meaning of this is that Technorati.com recorded a doubling of swine flu blog mentions every 16 1/2 hours until it ultimately peaked on 4/28 and interest began to subside.

I would propose that different social news events and topics will grow exponentially at different rates and doubling times. Peak interest levels would also be different. Perhaps we have a way or can find a parameter that can compare one event to another. Doubling time is definitely one of these parameters.

Thursday, May 07, 2009

Signal to Noise in Social Media, Part II

A couple of days ago I wrote about how social media can be monitored to indicate when items such as news or products are "buzzing". By applying techniques of industrial engineering such as a statistical process control (SPC) chart, we can find when an issue exceeds its normal chatter or noise level and becomes a real issue or "buzz". I mentioned how Chrysler appeared to become more than chatter on May 1st the day after they announced a bankruptcy filing. We saw a similar trend in April with Swine Flu.

Using Technorati.com data, I have been able to put the following SPC chart together for Chrysler.
A few comments about interpretation.

1) This type of chart is an exponentially weighted moving average (EWMA). The moving average is shown on the yellow line.
2) The actual data from April 1st to May 5th is shown on the blue line.
3) The red bands show the upper and lower +/- 3 sigma control limits.

In the normal sense of interpretation as long as the tracked response stays within the red bands the chatter that is going on is normal or expected. We have no special cause or event. It seems on average chrysler was already netting about 400-600 posts per day in the blogosphere (as reported by technorati.com). On April 24th something happened to indicate a result just slightly out of control. Daily posts tipped 1085 that day and were slightly outside the upper control limit. Some type of significant event had occured.

On April 23rd there was news about the US Treasury telling Chrysler to prepare for the bankruptcy filing. The increase in blog mentions was due to the nature of the news. There was peaked interest in Chrysler. A week later on April 30th, announcements about Chrysler filing bankruptcy hit the news. On the next day May 1st the blogosphere peaked at 2635 posts. Clearly five times the normal level of chatter; it was easy to see that blog writers were reacting to the news.

I am thinking this is a simple example and most people already were intelligent enough to recognize that Chrysler was in trouble. It had been discussed for months on and off in the media. What is important is that the technique here is very powerful. It could be used to detect small shifts in interest in a product or perhaps the popularity of a person or sports team.

These waves of popular interest can be tracked and taken advantage of if detected early enough.

Tuesday, May 05, 2009

Signal to Noise in Social Media, Part I

Charting Social media seems like a natural extension of industrial engineering techniques that could provide real value. Take for example the idea of signal to noise ratio or the concept of control limits for control charts. Is there value in trending social media and looking for changes in activity based on signal to noise? Two recent examples come to mind for me. 1) Swine Flu. and 2) Chrysler.

In late April 2009 the blogosphere went crazy discussing Swine Flu as a pandemic or outbreak may have an effect nationside. On April 27th posts mentioning Swine Flu peaked at near 11,000. By May 5th interest had dwindled back down to near zero levels. The weeks prior to April 27th there was little or no mention of the Swine Flu. If it was mentioned, it was just part of the routine noise that occurs in the blogs. Clearly at 11,000 posts, the signal to noise ratio was high. Any control chart indicating +/- 3 sigma control limits would have definitely indicated interest in Swine Flu was out of control at least from a statistical standpoint.

The question I have is do you think there is value in monitoring key interest in subjects within blogs? Would early warning or detection small changes in the average mention of a subject be of benefit? Is this type of advance intelligence important?

Case 2 surrounds Chrysler...In late 2008 Chrysler sought bailout funds from the US government in the form of loans. Interest in the blogs for Chrysler seemed to peak in the last weeks of December about 1300 posts per week. Recently Chrysler filed bankruptcy and the blogosphere went crazy. This time posts hit ~2600 per day on May 1st the day after the filing. Like the Swine Flu, the signal to noise ratio was extremely high.

I have been using Technorati.com to find these figures. Excuse me if the dates may be a day or two off as I am interpreting from Technorati charts visually. Hopefully I can learn the API calls to cull the exact count data out. I would think that standard control chart techniques can be applied to this type of data to detect early interest in popularity of keyword searches. Although this is still somewhat of a reactive measure it could detect when a product, ballplayer, service or other element makes a move in the eyes of the public. Is there value in this idea?

The idea is to sort out the signals from the noise in the data we are receiving from blogs. Hopefully by the next blog, I will have a chart developed for us to look at or discuss.