本站首页    管理页面    写新日志    退出


«October 2025»
1234
567891011
12131415161718
19202122232425
262728293031


公告

数据仓库&数据挖掘

  对某一件事需要坚持方能真正完成这件事

            薛 峰  

           2009.02.03


我的分类(专题)

日志更新

最新评论

留言板

链接

Blog信息
blog名称:数据仓库与数据挖掘
日志总数:85
评论数量:14
留言数量:0
访问次数:723495
建立时间:2005年3月17日




[数据仓库]IBM DB2 学习笔记 
网上资源

薛 峰 发表于 2005/6/30 8:53:42

【IBM DB2 学习笔记一】 【彭建军】
注意:在 IBM DB2 中,与 MS SQL Server 2000 中相同的语法或者概念,这里就不一一列出了。
一、【DB2 SQL 概述】    1、  【模式】     1.1、模式是已命名的对象(如表和视图)的集合。模式提供了数据库中对象的逻辑分类。     1.2、当在数据库中创建对象的时候,系统就隐性的创建了模式。当然,也可以使用 CREATE SCHEMA 显式的创建模式。     1.3、当命名对象的时候,需要注意对象的名称有两个部分,即,模式.对象名称,形如:pjj.TempTable1。如果不显示指定模式,则系统使用默认模式(默认用户的ID)。    2、  【数据类型】 定长字符串      CHAR(x)           x值域(1~254)       一个字节序列 变长字符串      VARCHAR(X)           


阅读全文(4448) | 回复(0) | 编辑 | 精华 | 删除
 


[数据挖掘]数据挖掘在电信欺诈侦测中的应用
随笔,  心得体会

薛 峰 发表于 2005/6/28 10:03:29

摘  要:电信领域欺诈现象比较突出,本文对数据挖掘技术在电信欺诈侦测中的应用进行研究,并利用某移动运营商的真实数据进行有效性验证。具体通过商业理解、数据理解、数据准备、模型生成、模型应用等几个步骤完成欺诈的侦测。在模型生成阶段利用聚类算法中的Kohonen神经网络算法,Kohonen是一种自组织学习算法。
关键字:数据挖掘;欺诈侦测;kohonen算法;CRISP-DM  
1 引言
随着移动业务的迅猛发展,移动通信业的收入日益增长。但是,随之而来的移动网络的欺诈行为也不断涌现,全球移动通信业都广泛面临着无线欺诈的严重问题,从而导致电信运营商的收入受到损失,额外支出的增加,进而致使利润下降,而移动用户的合法权益也受到损害,电信运营商的信誉无法得到保障。
无线欺诈类型可以简单的分为四类:(1)时间欺诈:占用了移动通信的时长而不付费用,该类欺诈可以分为两类,一是技术型欺诈(包括码机、魔术电话等),另一类是用户欺诈(漫游、滥用补充业务以及善意的欺诈行为); (2)内部欺诈:运营商内部人员利用职权非法牟利;(3)手机欺诈:进行非


阅读全文(5730) | 回复(0) | 编辑 | 精华 | 删除
 


数据挖掘部分算法的matlab实现 id3
随笔,  读书笔记

薛 峰 发表于 2005/6/27 14:22:13

 function D = ID3(train_features, train_targets, params, region)

% Classify using Quinlan´s ID3 algorithm
% Inputs:
% features - Train features
% targets     - Train targets
% params - [Number of bins for the data, Percentage of incorrectly assigned samples at a node]
% region     - Decision region vector: [-x x -y y number_of_points]
%
% Outputs
% D - Decision sufrace

[Ni, M]    = size(train_features);

%Get parameters
[Nbins, inc_node] = process_params(params);
inc_node    = inc_node*M/100;

%For the decision region
N           = region(5);
mx          = ones(N,1) * linspace (region(1),region(2),N);
my          = linspace (region(3),region(4),N)´ * ones(1,N);
flatxy      = [mx(:), my(:)]´;

%Preprocessing
[f, t, UW, m]      = PCA(train_features, train_targets, Ni, region);
train_features  = UW * (train_features - m*ones(1,M));;
flatxy          = UW * (flatxy - m*ones(1,N^2));;

%First, bin the data and the decision region data
[H, binned_features]= high_histogram(train_features, Nbins, region);
[H, binned_xy]      = high_histogram(flatxy, Nbins, region);

%Build the tree recursively
disp(´Building tree´)
tree        = make_tree(binned_features, train_targets, inc_node, Nbins);

%Make the decision region according to the tree
disp(´Building decision surface using the tree´)
targets = use_tree(binned_xy, 1:N^2, tree, Nbins, unique(train_targets));

D = reshape(targets,N,N);
%END

function targets = use_tree(features, indices, tree, Nbins, Uc)
%Classify recursively using a tree

targets = zeros(1, size(features,2));

if (size(features,1) == 1),
    %Only one dimension left, so work on it
    for i = 1:Nbins,
        in = indices(find(features(indices) == i));
        if ~isempty(in),
            if isfinite(tree.child(i)),
                targets(in) = tree.child(i);
            else
                %No data was found in the training set for this bin, so choose it randomally
                n           = 1 + floor(rand(1)*length(Uc));
                targets(in) = Uc(n);
            end
        end
    end
    break
end
        
%This is not the last level of the tree, so:
%First, find the dimension we are to work on
dim = tree.split_dim;
dims= find(~ismember(1:size(features,1), dim));

%And classify according to it
for i = 1:Nbins,
    in      = indices(find(features(dim, indices) == i));
    targets = targets + use_tree(features(dims, :), in, tree.child(i), Nbins, Uc);
end
    
%END use_tree

function tree = make_tree(features, targets, inc_node, Nbins)
%Build a tree recursively

[Ni, L]     = size(features);
Uc          = unique(targets);

%When to stop: If the dimension is one or the number of examples is small
if ((Ni == 1) | (inc_node > L)),
    %Compute the children non-recursively
    for i = 1:Nbins,
        tree.split_dim  = 0;
        indices         = find(features == i);
        if ~isempty(indices),
            if (length(unique(targets(indices))) == 1),
                tree.child(i) = targets(indices(1));
            else
                H               = hist(targets(indices), Uc);
                [m, T]          = max(H);
                tree.child(i)   = Uc(T);
            end
        else
            tree.child(i)   = inf;
        end
    end
    break
end

%Compute the node´s I
for i = 1:Ni,
    Pnode(i) = length(find(targets == Uc(i))) / L;
end
Inode = -sum(Pnode.*log(Pnode)/log(2));

%For each dimension, compute the gain ratio impurity
delta_Ib    = zeros(1, Ni);
P           = zeros(length(Uc), Nbins);
for i = 1:Ni,
    for j = 1:length(Uc),
        for k = 1:Nbins,
            indices = find((targets == Uc(j)) & (features(i,:) == k));
            P(j,k)  = length(indices);
        end
    end
    Pk          = sum(P);
    P           = P/L;
    Pk          = Pk/sum(Pk);
    info        = sum(-P.*log(eps+P)/log(2));
    delta_Ib(i) = (Inode-sum(Pk.*info))/-sum(Pk.*log(eps+Pk)/log(2));
end

%Find the dimension minimizing delta_Ib
[m, dim] = max(delta_Ib);

%Split along the ´dim´ dimension
tree.split_dim = dim;
dims           = find(~ismember(1:Ni, dim));
for i = 1:Nbins,
    indices       = find(features(dim, :) == i);
    tree.child(i) = make_tree(features(dims, indices), targets(indices), inc_node, Nbins);
end

阅读全文(10016) | 回复(2) | 编辑 | 精华 | 删除
 


[数据挖掘]数据挖掘部分算法的matlab实现 C4_5
网上资源,  随笔

薛 峰 发表于 2005/6/27 14:21:09

 function D = C4_5(train_features, train_targets, inc_node, region)

% Classify using Quinlan´s C4.5 algorithm
% Inputs:
% features - Train features
% targets     - Train targets
% inc_node    - Percentage of incorrectly assigned samples at a node
% region     - Decision region vector: [-x x -y y number_of_points]
%
% Outputs
% D - Decision sufrace

%NOTE: In this implementation it is assumed that a feature vector with fewer than 10 unique values (the parameter Nu)
%is discrete, and will be treated as such. Other vectors will be treated as continuous

[Ni, M] = size(train_features);
inc_node    = inc_node*M/100;
Nu          = 10;

%For the decision region
N           = region(5);
mx          = ones(N,1) * linspace (region(1),region(2),N);
my          = linspace (region(3),region(4),N)´ * ones(1,N);
flatxy      = [mx(:), my(:)]´;

%Preprocessing
%[f, t, UW, m]      = PCA(train_features, train_targets, Ni, region);
%train_features  = UW * (train_features - m*ones(1,M));;
%flatxy          = UW * (flatxy - m*ones(1,N^2));;

%Find which of the input features are discrete, and discretisize the corresponding
%dimension on the decision region
discrete_dim = zeros(1,Ni);
for i = 1:Ni,
   Nb = length(unique(train_features(i,:)));
   if (Nb <= Nu),
      %This is a discrete feature
      discrete_dim(i) = Nb;
      [H, flatxy(i,:)] = high_histogram(flatxy(i,:), Nb);
   end
end

%Build the tree recursively
disp(´Building tree´)
tree        = make_tree(train_features, train_targets, inc_node, discrete_dim, max(discrete_dim), 0);

%Make the decision region according to the tree
disp(´Building decision surface using the tree´)
targets = use_tree(flatxy, 1:N^2, tree, discrete_dim, unique(train_targets));

D   = reshape(targets,N,N);
%END

function targets = use_tree(features, indices, tree, discrete_dim, Uc)
%Classify recursively using a tree

targets = zeros(1, size(features,2));

if (tree.dim == 0)
   %Reached the end of the tree
   targets(indices) = tree.child;
   break
end
        
%This is not the last level of the tree, so:
%First, find the dimension we are to work on
dim = tree.dim;
dims= 1:size(features,1);

%And classify according to it
if (discrete_dim(dim) == 0),
   %Continuous feature
   in = indices(find(features(dim, indices) <= tree.split_loc));
   targets = targets + use_tree(features(dims, :), in, tree.child(1), discrete_dim(dims), Uc);
   in = indices(find(features(dim, indices) >  tree.split_loc));
   targets = targets + use_tree(features(dims, :), in, tree.child(2), discrete_dim(dims), Uc);
else
   %Discrete feature
   Uf = unique(features(dim,:));
for i = 1:length(Uf),
   in      = indices(find(features(dim, indices) == Uf(i)));
      targets = targets + use_tree(features(dims, :), in, tree.child(i), discrete_dim(dims), Uc);
   end
end
    
%END use_tree

function tree = make_tree(features, targets, inc_node, discrete_dim, maxNbin, base)
%Build a tree recursively

[Ni, L]     = size(features);
Uc         = unique(targets);
tree.dim = 0;
%tree.child(1:maxNbin) = zeros(1,maxNbin);
tree.split_loc = inf;

if isempty(features),
   break
end

%When to stop: If the dimension is one or the number of examples is small
if ((inc_node > L) | (L == 1) | (length(Uc) == 1)),
   H = hist(targets, length(Uc));
   [m, largest] = max(H);
   tree.child = Uc(largest);
   break
end

%Compute the node´s I
for i = 1:length(Uc),
    Pnode(i) = length(find(targets == Uc(i))) / L;
end
Inode = -sum(Pnode.*log(Pnode)/log(2));

%For each dimension, compute the gain ratio impurity
%This is done separately for discrete and continuous features
delta_Ib    = zeros(1, Ni);
split_loc =

(下面还有2838字)

阅读全文(11661) | 回复(0) | 编辑 | 精华 | 删除
 


[数据挖掘]市场细分——企业成功的一大法宝
随笔,  读书笔记,  心得体会

薛 峰 发表于 2005/6/27 14:13:21

企业经营者必须通过市场调研,根据消费者对商品的不同欲望与需求、不同的购买行为与习惯,把消费者整体市场划分为具有一定的类似性特征的若干子市场。

          市场细分的重要作用

  一、有利于企业确定自己的目标市场。目标市场能否正确选择,直接决定着企业今后一系列发展战略的确定,决定了企业今后若干年发展后劲的“先天条件”。所以企业必须在深入进行市场细分化的基础上,寻找一个理想的目标市场。如广东江门市有一专门经营空调产品的公司——江门市时尚冷气公司,原叫供销综合贸易公司,主要经营土产、日杂商品,由于商品附加价值低,经营单位多,竞争激烈,企业经济效益每况愈下,如继续经营,将难以为继。于是在1988年,该公司决策部门经过对企业内外部环境条件的认真分析,决定抓住当时空调机供不应求的大好时机,立足江门市区,辐射珠江三角洲等经济发达地区,充分发挥供销企业的传统优势,努力争取货源,利用靠近港澳的优势,直接经营进口名牌空调,减少中间环节,降低流通费用。结果由于经营方向对路,目标市场选择恰当,充分抓住了目标市场上对空调

(下面还有4580字)

阅读全文(3201) | 回复(0) | 编辑 | 精华 | 删除
 


« 1 2 3 4 5 6 7 8 9 10 »



站点首页 | 联系我们 | 博客注册 | 博客登陆

Sponsored By W3CHINA
W3CHINA Blog 0.8 Processed in 0.195 second(s), page refreshed 144814461 times.
《全国人大常委会关于维护互联网安全的决定》  《计算机信息网络国际联网安全保护管理办法》
苏ICP备05006046号