Automatic Transfer system, since the data center introduced the higher-ups technologies such as cloud computing, virtualization, immediately changed shape, these technology significantly improved the efficiency of data centers, has brought many benefits to data center.However, there are two sides to anything, when we enjoy the benefits of new technology, also brought inconvenience to the data center operations management, need the number, size and complexity of the management object from exponential growth, the traditional manual intervention, nanny management monitoring and troubleshooting methods cannot meet the requirements.Such as for public cloud and large private cloud, servers often can reach tens of thousands to hundreds of thousands, millions of scale, all kinds of systems of cloud services and tenants, the number of business application load level has reached tens of millions and millions of degree, so on artificial maintenance is not reality, all must introduce automatic and intelligent operations management model, the maintenance management efficiency from the average per capita several servers, promoted to thousands of servers per person.Operation and maintenance management cannot be a stumbling block to cloud development in the data center, and it should also follow the development of data center.This paper will introduce some new techniques in modern operation and maintenance.
Automatic artificial fault repair mechanism.
Data centers inevitably have problems that are not only slow but also prone to miscalculation.You might as well leave this identification to the software.First, to establish a fault pattern library, long-term accumulation of various once or may appear failure anticipation, recognition, and the failure library content to keep updating, real time to put some new fault type and experience are entered.Second, inform the software device fault judgment method, judging by the software automatically, the software according to the collected from each data center equipment running parameters, compared with the parameters of the fault mode garage save, if found to have the same, as the data center is out of order.Finally, the data center can notify the operation and maintenance personnel by warning, or the software can perform one-click repair.The significance of this depends on the business and the accumulation of experience the richness of data center failure, one thousand error recovery actions, may lead to secondary fault, bring greater losses to the data center, so repair mechanisms must discreet, non-emergency business failure is not recommended automatic repair, for staff to manually perform repair again after confirmation.Increased, in fact, the introduction of cloud computing data center fault automatic detection and the difficulty of the repair, all the application of the business and from the physical hardware equipment, formed a pure software in the virtual world, complex virtual system brought trouble trying to identify and distinguish a difficulty, which brought great challenge to the automation of artificial fault repair.However, the road to automation of data center operations is inevitable, and too much human cost can't be borne by any data center with high speed expansion.
Log and monitor information centralized management and control.
Traditional data center, the software and hardware system log monitoring information are often relatively scattered isolated, no implement automatic associated with business and user, when fault occurs, even need to login to each device, inefficient.In some data centers, despite the deployment of network management systems and log servers, manual inspections are still required.When hundreds of thousands of devices output logs at the same time, huge amounts of data can't be checked at all, and this information needs to be analyzed and judged.A lot of cloud platform, data center construction operations management is to get the unified handling these huge amounts of data, is still by judgment conditions in advance, and then found that do not conform to the conventional log in a timely manner the alarm.Cloud judgment conditions ignored the log alarm equipment, only care about have an impact on the business of log information, design some special judge fault conditions, these conditions need to communicate with all sorts of equipment manufacturers, has confirmed these judgments are effective, and deployment in the cloud platform.Cloud platform function is very powerful, just rely on the equipment log to diagnose the active output is not enough, it can also be active from the data center of any link acquisition monitoring information, the real-time monitoring information can reflect the whole data center integrated state of the system operation, once appear abnormal, the value of parameters or have change, will be cause for alarm, the alarm output.
Machine learning mechanism for big data.
Traditional data center find the fault repair advice and treatment, mainly depends on the cloud platform to collect the log and monitoring information, through the operations staff long-term accumulated experience, people's behavior is the most unreliable, experience a lot of the time is wrong, the machine will not go wrong, as long as you give it enough learning information, it can make the right judgment.Master recently very fire, and the Master is a robot will go, in a recent chess game, has obtained the 60 wins and 1, 1 and or because dropped by system and, defeated candidates including wei-ping nie go top masters, this shows that as long as give enough study time equipment, it can be far more than human wisdom.Operations management can also introduce the machine learning technology, through the analysis of the data center operations mass data, using big data modeling, automation, intelligent dig out more high-value, outside the scope of operations staff cognitive failure mode and system optimization, so as to further enhance the efficiency of the system operations.By big data, machine learning, the large-scale operational performance and failure law of scenario analysis, trend prediction and fault returning for identification, improve machine automation operational ability, eventually it will be more than one pt dimension accuracy, like robots play weiqi, final machine operational data center will greatly exceed the automatic to do it.People just need to study how to get these machines to learn properly and learn well.
It is obvious that the new automation operation and maintenance technology features of cloud data center are mainly: automation and self-learning.Self-learning by machine, automatic completion of data center operation and repair.Although the future data center is larger and more complex, the management of operation and maintenance should be simplified to realize automatic operation and maintenance management.Data center operations exclude human factors, so that the data center can form a complete autonomous system and realize the true unmanned data center.Of course, there is still a long way to go for automated operations in the data center, and no data center can truly be detached from human participation.This is like self-driving car technology development, technology complex, and completely changed the existing way of life, it takes a long time to accept people to accept it.As for the data center, the automatic operation and maintenance technology is good, but it is not mature enough. Many people hold a wait-and-see attitude and hope that the future technology can be improved quickly.
留言列表