I read the Chinese version, so I wrote this post in Chinese …

全书基本上分了两部分,第一部分主要是讲述电气化时代,对整个世界的影响,并以此引出计算机+互联网成为公用服务时对人类社会可能造成的影响;第二部分主要是介绍了时下的各种热门互联网服务及其发展趋势。总的说来,第一部分还比较有趣,其它部分伐善可陈,同类题材畅销书太多了。

电气化时代对人类的影响和贡献自是不必赘述,但爱迪生这个大发明家在其中的沉浮倒是值得思考:
- 爱迪生灵感十足,富有远见,不仅发明了白炽灯,还规划并建立了整个电气时代的生态环境。
- 在这个电气时代的生态环境中,爱迪生不仅从技术上提供了可行性,还在商业上建立了完整的体系。
- 但由于技术上过于相信直流电的作用,以及受商业模式的诱惑(小电厂越多,越能卖出更多的发电设备),对交流电在店里传输中的优势以及发电厂的大规模集中化趋势视而不见。

可以学习的教训:
- 如果你开发设计某个产品功能的目的是因为它可以更好地为你带来收益,那么很有可能你正走在错误的道路上。
- 产品设计的首要目的是以最正确的方式解决人们真实的需求,在此基础之上才能考虑盈利和赚钱。

拿最近三百大战来说,为什么百度的股价在360的出击下显得如此不堪一击?主要的原因还是百度的竞价排名模式被360死死盯住不放。竞价排名的模式显然严重影响了搜索引擎的使用体验以及道德上的正义性。

此书第二部分有部分内容基本是Long Tail 和 Micro Trend的大杂烩。其它比较值得注意的几点:
- 人类社会在逐步进入礼品经济时代,人类从事某项活动的动机并不完全取决于经济利益;而这,将会成为互联网服务提供商获取廉价数据的源头
- 在互联网时代,每个人所有的踪迹将无处可藏;在享受互联网便利的同时,我们将面临巨大的隐私泄露风险
- 人脑和互联网智能系统的结合,将引领人类进入更加激动人心的未来

最末,作者有段对技术变革大潮的感慨挺能引起共鸣:“所有技术的进步都会涉及两代人,而它的全部力量和影响,则要等到新时代的第二代完全成长起来,将创造出这些技术的老人们挤到历史的故纸堆里的时候才会显现。技术的进步就是这样,好像我们今天所有用的一切都是那么的理所当然。本世纪末,人类将不再拥有对没有电脑和互联网的生活的记忆,而我们将是带走这种最后记忆的人。”

 

 

红海与蓝海
现存的市场由两种海洋所组成:即红海和蓝海。
红海代表现今存在的所有产业,也就是我们已知的市场空间;蓝海则代表当今还不存在的产业,这就是未知的市场空间。
针对不同的市场,有不同的经营战略。

红海战略
从当前已有市场中分一杯羹,最著名的是Michael Porter的竞争战略,其理论主要包含两个大的理论框架

Five Force Analysis(五力分析框架)
- Bargaining Power of Suppliers
- Bargaining Power of Buyers
- Threat of new Entrants
- Substitutes
- Rivalry

Three Generic Strategies(三大通用战略)
- Overall Cost Leadership
- Differentiation
- Focus

要在红海市场里取胜,能采取的战略不多:要么把成本控制到比竞争者更低的程度;要么在企业产品和服务中形成与众不同的特色,让顾客感觉到你提供了比其他竞争者更多的价值;要么企业致力于服务于某一特定的市场细分、某一特定的产品种类或某一特定的地理范围。

蓝海战屡
- 开拓新的市场空间,避免与现有竞争对手血拼

对于蓝海战略,主要包含一个价值创新分析框架、一个重新定义市场边界的方法论还有一个分析制定战略的步骤。

战略分析框架主要包括战略布局图和四步动作框架:
- 战略布局图(Strategy Canvas) :由某一市场上的主要竞争要素和该市场里的产品/服务在各项要素上的得分构成。也可以认为一个战略布局图是由某个市场的一系列产品和服务的价值曲线(Value Curve)构成。
- 四步动作框架:减少没有价值的要素,降低价值弱化的要素,增加愈发重要的要素,创造尚未出现的要素

在分析制定价值曲线的时候,一个良好曲线的标准是:

- 重点突出,这样才能降低成本,突出优势
- 另辟蹊径,这样才能体现与同类商品的区分度
- 主题信服,这样才能给购买者带来真正的价值

战略布局图上每种产品都有一个价值曲线,发掘创造符合蓝海战略的价值曲线的过程,也叫做价值创新(Value Innovation),create value for both producer and consumer in a totally new defined market, ignore competitors in existing market, focus on yourself and your product/service。

重建市场边界,则是改变现有市场范围,发掘更多目标客户的过程,主要的方法有:
- 跨越他择市场(Alternative Market),成为他择产品的替代品
- 跨越市场分组(Market Group: High/Low End),融合面向不同市场划分的功能
- 跨越买方链(Buyer),直达真正的消费者
- 跨越互补产品和服务,整合强大生态链
- 跨越功能和感官界限,为现有系统增加、减少功能与感官上的特点
- 跨越时间,拥抱甚至参与创造即将到来的新趋势

战略指定步骤

一个典型的战略评估、制定周期:衡量客户效用->制定战略价格->评估目标成本->解决接受障碍
一个蓝海战略必须要为目标客户提供卓越的效用,创造市场需求,让客户产生购买的动机;
- 制定合理的价格,让目标客户获得合理的性价比,才能产生真实的市场交易;定价因素:效用的卓越程度、竞争门槛、当前市场价格范围;
从预定价格和期望利润率推导出目标成本,并想方设法评估目标成本的可行性,而不是相反的过程;
- 考虑来自雇员、伙伴和公众的对新战略的阻碍因素并提前做好应对策略,保证战略的顺利执行

在蓝海战略中的定价往往是一种战略活动,不可在这方面因成本因素做妥协,常见的压缩成本的方法:简化运营、技术创新、与供应商建立伙伴关系。如果成本实在无法压缩,还可以采取价格创新策略,比如:由出卖拥有权到出卖使用权;由收取现金到拥有客户股票;将实体现货变成期货交易

 

 

Managing large scale data center automatically without too much human involving is always a challenging task. Industrial  giants such as Google and Microsoft are pioneer in this area and very little information is leaked about how they handling such problems. But in 2007,  Michael Isard of Microsoft Research wrote a paper entitled Autopilot: Automatic Data Center Management which describes the technology that Windows Live and Live Search services have used to manage their server farms. This is a great opportunity to look at how industrial giant manage tens of thousands of machines using software.

Design Principle

- Fault tolerant, any component can fail at any time, the system must be reliable enough to continue automatically with some proportion of its computers powered down or misbehaving
- Simplicity, simplicity is as important as fault-tolerance when building a large-scale reliable, maintainable system. Avoid unnecessary optimization, and unnecessary generality.

Datacenter layout

A typical application‖ rack might contain 20 identical multi-core computers, each with 4 direct-attached hard drives. Also in the rack is a simple switch allowing the computers to communicate locally with other computers in the rack, and via a switch hierarchy with the rest of the data center.

Finally each computer has a management interface, either built in to the server design or accessed via a rack-mounted serial concentrator.

The set of computers managed by a single instance of Autopilot is called a cluster.

Autopilot architecture

Autopilot consists of three sub systems
- Hardware management, including machine/switch/router state maintain, auto error repair, os provisioning etc.
- Deployment, automatically deploy application and data to specified machines in a data center.
- Monitoring, monitor the state of device and service inside the data center, collect performance counter and user friendly display UI.

Autopilot Architecture

Hardware Management
- Main responsibility of Device Manager, it maintains a replicated state for each device in the data center
- It makes decision to reboot, re-image or retire a physical machine/switch/router
- It periodically discover new machine through  the special management interface, either built in to the server design or accessed via a rack-mounted serial connector
- It automate the OS installation process through Provisioning Service
- It automate the error repair process using a Repair Service
- It collection device state from various Watchdog Service

 Deployment
- Machine is assigned to a machine function, which indicates what role it plays and what kind of services will run on it
- Machine is also assigned to a scale unit, which is a machine collection that serves as application/os update unit
- Each machine is responsible for running a list of application/autopilot service and this list is stored as service manifest file. Multiple version of manifest file can be stored in a machine, only one is active, others are kept for switch to active or rollback when upgrading failed
- Device manager maintains the manifest file list of each machine in the cluster and its corresponding active version
- Deployment service is a multi-node service which stores all the application/data files listed in the service manifest. These files are synced from external building system.
- Autopilot operator trigger new code deployment by a single command to Device Manager. DM then update service manifest of specified machines accordingly and kick each machine to start to sync bits from deployment service and run them. Machine in the cluster then sync the manifest file and download specified application/data to local disk and start them.
- In normal case, each machine periodically query DM what manifest should be on its local disk. It will fetch one from deployment service if needed manifest files are missing

Monitoring
-  Watchdog, it constantly probe the status of other service/machine and report it back to device manager. Autopilot provides some system wide watchdog, but application developer can build their own ones as long as these service knows how to talk to DM about device status
- Performance counters  are used to record the instantaneous state of components, for example a time-weighted average of the number of requests per second being processed by a particular server.
- The Collection Service forms a distributed collection and aggregation tree for performance counters. It can generate a centralized view of the current state of the cluster‘s performance counters with a latency of a few seconds.
- All collected information is stored in a center SQLServer for fast and complex querying by end user. These data is exposed to application developer and operator through a http based service called cockpit service.
- Besides global view of status of the data center, cockpit is also responsible for access some resources (for example, application/data/log files)
- Predefined status query and abnormal result are combined to form an alter service. It can send out email and even phone call when some critical situations happen.

Reference

http://research.microsoft.com/pubs/64604/osr2007.pdf

http://www.25hoursaday.com/weblog/2008/08/11/ManagingLargeWebServerFarmsMicrosoftsAutoPilot.aspx

 

The presentation is divided into 5 parts:
- Wechat History
- On User
- On Requirement
- On Design
- On Interaction
Highlight points on User
- People is lazy, let them do/click less in order to reach some goal
- People likes fashion, do something really cool to attract them
- People lacks of patience, do not let them read manual or tip
- People’s time is fragmented, do not give them some task that needs lots of continues time
- People gets stupid when they are in mass, treat them somewhat stupid without too much judge
- People is emotional, they seeks for inner satisfaction, for the feeling of being
- People likes uncertainty, the has  lots of curiosity to unknown stuff
- People is social animal, they want to know more people
- Know your user from psychological perspective
Highlight points on Requirement
- Product is designed to satisfy some desire that lives in people’s heart and daily life
- Satisfy user, don’t put too much moral judge in your product design
- Purify and abstract all the feedback got from end user, don’t just do what user tells you literally
- Try to get to know your target user from weibo/forum etc.
- Revolutionary product comes out when the society changed
- Different people usually has some common requirement, that’s the most important thing you need to work on
- Associate feature requirement with psychological desire, people is emotional
- Think in large scale and massive group for social product/feature
- Focus on few but vital scenarios, ignore other trivial stuff
- Polling/survey can only help you improving existing feature, can’t help you  on new product/feature
- Feature requirement comes from solving problems form you and your friends
Highlight points on Design
- Evolve your product gradually, you can’t design a perfect product at the first hit, every product has it’s own life cycle
- Products that has clear DNA will survive longer time
- Design the product structure first, and then focus on detail
- Categorizing, make things clean and clear
- Loving abstraction, make things simple and easy
- Design from scenario, not feature list
- Be careful about over design
- Drop feature that won’t makes you and user exciting
- Responsiveness is the king of user experience
- Ship feature gradually, don’t move too fast, change too much in one step
- Give user the rights to choose, core + plugin
- Respect your user: protect their privacy,  save their temporary input, broadcast message signed by real name, not “system administrator”
- One thing for all, not one version for one zone
- Design for user, user is the major role, not design itself
- Makes things as nature as possible, don’t makes people think
- Hide technology from common user
- Focus, less is more
Highlight points on Interaction
- UI serves for feature
- Makes it simple and clean
- Each screen has its own topic
- Hide numbers
Some Comments
- Emphasized too much on the importance of product manager. Most of the time, whether a product will succeed (especially in China) depends on what product you are going to do and what’s the platform you can leverage. For social product, existing user data and connection is the most important thing.
- Product manager is not god. God determines everything, but product manager should design product as desired (explicitly or implicitly) by user.
- Too many critics on competitor’s product design, but those features are what I (as a normal user) think Wechat should add.
- Wechat is the most successful product in the market, but what’s the real reason? Because it’s different feature design? The only reason I think is that it’s backed by Tencent, which has large mount of QQ user and its binding to QQ friends.
- He said a lot about avoiding “over design”, but also talks about too much on active design.
- The presentation lacks of something called “无为而治”, whether a product will success or not, what the final running system will look like is not only determined by how product manager design it, but also by how people interact with it.
- He should also thanks to QQ user data, to the competitors, to the creator of kik, to the great mobile Internate  time.
 

Problem Scale at Today
- 550M active users
- x10M peak online users
- B scale daily Page View
- Peta scale UGC data
- 100B daily requests

Qzone 1.0 – 3.0 (0 ~ 1M online user, 2004 – 2006)

  • Architecture
    • Special Windows Client (embedded html)
    • Apache + Cache + MySql
      * App/CGI calls different data service to cook a result page for user request
    • One ISP one service cluster
      * Users from Telecom/Netcom are served by different dedicated servers
      * App calls data service in the same ISP
  • Problem (v1/v2)
    • special client -> hard to debug
    • web server is not scalable
    • 30~40 nodes, max up to 500k online user
  • Solution (v3)
    • Rich Client
      • move some logic from server to client
      • Client is ajax based, server logic is simplified
    • Dynamic/Static separation
      • Static data is hosted by light weight web server qHttpd
      • 100x performance improve
    • Web server optimization
      • Replace apache with qzHttp for dynamic logic
      • 3x performance improve
    • Main page caching
      • Staticlize and cache elements of main page
      • Elements are updated periodically or on-demand

Qzone 4.0-5.0 (1M ~ 10M online user)

  • ISP separation problem: dynamic data
    • All dynamic services are hosted within one ISP
    • Other ISP works as proxy to call these services
      • Dedicated network connection between proxy and service
    • User in other ISP don’t call services in other ISP directly
  • ISP separation problem: static data
    • Static:Dynamic ~ 10:1, adopt  CDN solution
    • Redirect static request to ISP specific static data server according to client IP information
      • By Qzone app logic using client Ip info
      • Previously using DNS to do redirection which causes lots of problem
      • Due to local DNS setting problem
  • Improve user experience
    • Improve critical service’s availability
      • Replicated core service
    • Do lossy service for non-critical service
      • Skip some service if time out
      • Can also use default value if failed/time out
    • Fault tolerant design from backend service to client script
      • Default value at client
      • LVS for Qzone web server
      • L5(F5?) for internal critical service
    • Control time-out time for the whole request processing
      • Kind of real time scheduling algorithm
  • Incremental Release
    • Release new feature to end user from small scope to larger scope
    • Team internal dogfood
    • Whitelist (invited) user test
    • Company wide dogfood
    • Vip external user
    • Roll out globally

Qzone 6.0+ ( ~100M online user)

  • Open platform
    • App/Platform separation
    • iFrame based app model
    • App’s dev/test/deploy is totally separated from Qzone platform
    • Separation of concerns and parallel evolving path
  • GEO replication – handle IDC failure
    • One IDC for write
    • Multiple IDCs for read
    • Dedicated synchronization protocol
  • Monitoring
    • Bandwidth/Latency/Error monitoring
    • Problem locating

Comments

  1. All contents are very general, not too many details
  2. Not touched the core problem: how to scale, how to partition so large scale data
  3. Single IDC write will cause service availability problem in case of disaster unless reconfiguration is supported

Reference

Article: http://www.infoq.com/cn/articles/qzone-architecture
Video: http://djt.open.qq.com/topic-Shenzhen_Qzone.html
Slides: http://djt.qq.com/article-232-1.html

 

 
Set your Twitter account name in your settings to use the TwitterBar Section.